Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6939
George Bebis Richard Boyle Bahram Parvin Darko Koracin Song Wang Kim Kyungnam Bedrich Benes Kenneth Moreland Christoph Borst Stephen DiVerdi Chiang Yi-Jen Jiang Ming (Eds.)
Advances in Visual Computing 7th International Symposium, ISVC 2011 Las Vegas, NV, USA, September 26-28, 2011 Proceedings, Part II
13
Volume Editors George Bebis, E-mail:
[email protected] Richard Boyle, E-mail:
[email protected] Bahram Parvin, E-mail:
[email protected] Darko Koracin, E-mail:
[email protected] Song Wang, E-mail:
[email protected] Kim Kyungnam, E-mail:
[email protected] Bedrich Benes, E-mail:
[email protected] Kenneth Moreland, E-mail:
[email protected] Christoph Borst, E-mail:
[email protected] Stephen DiVerdi, E-mail:
[email protected] Chiang Yi-Jen, E-mail:
[email protected] Jiang Ming, E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-24030-0 e-ISBN 978-3-642-24031-7 DOI 10.1007/978-3-642-24031-7 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011935942 CR Subject Classification (1998): I.3-5, H.5.2, I.2.10, J.3, F.2.2, I.3.5 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
It is with great pleasure that we welcome you to the proceedings of the 7th International Symposium on Visual Computing (ISVC 2011) which was held in Las Vegas, Nevada. ISVC provides a common umbrella for the four main areas of visual computing including vision, graphics, visualization, and virtual reality. The goal is to provide a forum for researchers, scientists, engineers and practitioners throughout the world to present their latest research findings, ideas, developments, and applications in the broader area of visual computing. This year, the program consisted of 12 oral sessions, 1 poster session, 5 special tracks, and 6 keynote presentations. The response to the call for papers was very good; we received over 240 submissions for the main symposium from which we accepted 68 papers for oral presentation and 46 papers for poster presentation. Special track papers were solicited separately through the Organizing and Program Committees of each track. A total of 30 papers were accepted for oral presentation in the special tracks. All papers were reviewed with an emphasis on potential to contribute to the state of the art in the field. Selection criteria included accuracy and originality of ideas, clarity and significance of results, and presentation quality. The review process was quite rigorous, involving two–three independent blind reviews followed by several days of discussion. During the discussion period we tried to correct anomalies and errors that might have existed in the initial reviews. Despite our efforts, we recognize that some papers worthy of inclusion may have not been included in the program. We offer our sincere apologies to authors whose contributions might have been overlooked. We wish to thank everybody who submitted their work to ISVC 2011 for review. It was because of their contributions that we succeeded in having a technical program of high scientific quality. In particular, we would like to thank the ISVC 2011 Area Chairs, the organizing institutions (UNR, DRI, LBNL, and NASA Ames), the government and industrial sponsors (Intel, DigitalPersona, Ford, Hewlett Packard, Mitsubishi Electric Research Labs, Toyota, Delphi, General Electric, Microsoft MSDN, and Volt), the international Program Committee, the special track organizers and their Program Committees, the keynote speakers, the reviewers, and especially the authors that contributed their work to the symposium. In particular, we would like to thank Mitsubishi Electric Research Labs for kindly sponsoring a “best paper award” this year. We sincerely hope that the proceedings of ISVC 2011 will offer opportunities for professional growth. July 2011
ISVC’11 Steering Committee and Area Chairs
Organization
ISVC 2011 Steering Committee Bebis George Boyle Richard Parvin Bahram Koracin Darko
University of Nevada, Reno, USA and King Saud University, Saudi Arabia NASA Ames Research Center, USA Lawrence Berkeley National Laboratory, USA Desert Research Institute, USA
ISVC 2011 Area Chairs Computer Vision Wang Song Kim Kyungnam (Ken)
University of South Carolina, USA HRL Laboratories, USA
Computer Graphics Benes Bedrich Moreland Kenneth
Purdue University, USA Sandia National Laboratory, USA
Virtual Reality Borst Christoph DiVerdi Stephen Visualization Chiang Yi-Jen Jiang Ming
University of Louisiana at Lafayette, USA Adobe, USA
Polytechnic Institute of New York University, USA Lawrence Livermore National Lab, USA
Publicity Albu Branzan Alexandra Pati Peeta Basa
University of Victoria, Canada CoreLogic, India
Local Arrangements Regentova Emma
University of Nevada, Las Vegas, USA
Special Tracks Sun Zehang
Apple, USA
VIII
Organization
ISVC 2011 Keynote Speakers Comaniciu Dorin Geist Robert Mueller Klaus Huang Thomas Li Fei-Fei Lok Benjamin
Siemens Corporate Research, USA Clemson University, USA Stony Brook University, USA University of Illinois at Urbana-Champaign, USA Stanford University, USA University of Florida, USA
ISVC 2011 International Program Committee (Area 1) Computer Vision Abidi Besma Abou-Nasr Mahmoud Agaian Sos Aggarwal J.K. Albu Branzan Alexandra Amayeh Gholamreza Agouris Peggy Argyros Antonis Asari Vijayan Athitsos Vassilis Basu Anup Bekris Kostas Belyaev Alexander Bensrhair Abdelaziz Bhatia Sanjiv Bimber Oliver Bioucas Jose Birchfield Stan Bourbakis Nikolaos Brimkov Valentin Campadelli Paola Cavallaro Andrea Charalampidis Dimitrios Chellappa Rama Chen Yang Cheng Hui Chowdhury Amit K. Roy Cochran Steven Douglas Chung Cremers Daniel
University of Tennessee at Knoxville, USA Ford Motor Company, USA University of Texas at San Antonio, USA University of Texas, Austin, USA University of Victoria, Canada Eyecom, USA George Mason University, USA University of Crete, Greece University of Dayton, USA University of Texas at Arlington, USA University of Alberta, Canada University of Nevada at Reno, USA Max-Planck-Institut f¨ ur Informatik, Germany INSA-Rouen, France University of Missouri-St. Louis, USA Johannes Kepler University Linz, Austria Instituto Superior Tecnico, Lisbon, Portugal Clemson University, USA Wright State University, USA State University of New York, USA Universit` a degli Studi di Milano, Italy Queen Mary, University of London, UK University of New Orleans, USA University of Maryland, USA HRL Laboratories, USA Sarnoff Corporation, USA University of California at Riverside, USA University of Pittsburgh, USA Chi-Kit Ronald, The Chinese University of Hong Kong, Hong Kong University of Bonn, Germany
Organization
Cui Jinshi Darbon Jerome Davis James W. Debrunner Christian Demirdjian David Duan Ye Doulamis Anastasios Dowdall Jonathan El-Ansari Mohamed El-Gammal Ahmed Eng How Lung Erol Ali Fan Guoliang Ferri Francesc Ferryman James Foresti GianLuca Fowlkes Charless Fukui Kazuhiro Galata Aphrodite Georgescu Bogdan Gleason Goh Wooi-Boon Guerra-Filho Gutemberg Guevara Gustafson David Hammoud Riad Harville Michael He Xiangjian Heikkil Janne Heyden Anders Hongbin Zha Hou Zujun Hua Gang Imiya Atsushi Jia Kevin Kamberov George Kampel Martin Kamberova Gerda Kakadiaris Ioannis Kettebekov Sanzhar Khan Hameed Ullah Kim Tae-Kyun Kimia Benjamin Kisacanin Branislav
IX
Peking University, China CNRS-Ecole Normale Superieure de Cachan, France Ohio State University, USA Colorado School of Mines, USA Vecna Robotics, USA University of Missouri-Columbia, USA National Technical University of Athens, Greece 510 Systems, USA Ibn Zohr University, Morocco University of New Jersey, USA Institute for Infocomm Research, Singapore Ocali Information Technology, Turkey Oklahoma State University, USA Universitat de Valencia, Spain University of Reading, UK University of Udine, Italy University of California, Irvine, USA The University of Tsukuba, Japan The University of Manchester, UK Siemens, USA Shaun, Oak Ridge National Laboratory, USA Nanyang Technological University, Singapore University of Texas Arlington, USA Angel Miguel, University of Porto, Portugal Kansas State University, USA DynaVox Systems, USA Hewlett Packard Labs, USA University of Technology, Sydney, Australia University of Oulu, Finland Lund University, Sweden Peking University, China Institute for Infocomm Research, Singapore IBM T.J. Watson Research Center, USA Chiba University, Japan IGT, USA Stevens Institute of Technology, USA Vienna University of Technology, Austria Hofstra University, USA University of Houston, USA Keane Inc., USA King Saud University, Saudi Arabia Imperial College London, UK Brown University, USA Texas Instruments, USA
X
Organization
Klette Reinhard Kokkinos Iasonas Kollias Stefanos Komodakis Nikos Kozintsev Kuno Latecki Longin Jan Lee D.J. Li Chunming Li Fei-Fei Li Xiaowei Lim Ser N Lin Zhe Lisin Dima Lee Seong-Whan Leung Valerie Leykin Alex Li Shuo Li Wenjing Liu Jianzhuang Loss Leandro Luo Gang Ma Yunqian Maeder Anthony Maltoni Davide Mauer Georg Maybank Steve McGraw Tim Medioni Gerard Melenchn Javier Metaxas Dimitris Miller Ron Ming Wei Mirmehdi Majid Monekosso Dorothy Mueller Klaus Mulligan Jeff Murray Don Nait-Charif Hammadi Nefian Ara Nicolescu Mircea Nixon Mark Nolle Lars
Auckland University, New Zeland Ecole Centrale Paris, France National Technical University of Athens, Greece Ecole Centrale de Paris, France Igor, Intel, USA Yoshinori, Saitama University, Japan Temple University, USA Brigham Young University, USA Vanderbilt University, USA Stanford University, USA Google Inc., USA GE Research, USA Adobe, USA VidoeIQ, USA Korea University, Korea ONERA, France Indiana University, USA GE Healthecare, Canada STI Medical Systems, USA The Chinese University of Hong Kong, Hong Kong Lawrence Berkeley National Lab, USA Harvard University, USA Honyewell Labs, USA University of Western Sydney, Australia University of Bologna, Italy University of Nevada, Las Vegas, USA Birkbeck College, UK West Virginia University, USA University of Southern California, USA Universitat Oberta de Catalunya, Spain Rutgers University, USA Wright Patterson Air Force Base, USA Konica Minolta Laboratory U.S.A., Inc., USA Bristol University, UK University of Ulster, UK Stony Brook University, USA NASA Ames Research Center, USA Point Grey Research, Canada Bournemouth University, UK NASA Ames Research Center, USA University of Nevada, Reno, USA University of Southampton, UK The Nottingham Trent University, UK
Organization
Ntalianis Klimis Or Siu Hang Papadourakis George Papanikolopoulos Nikolaos Pati Peeta Basa Patras Ioannis Petrakis Euripides Peyronnet Sylvain Pinhanez Claudio Piccardi Massimo Pietikinen Matti Porikli Fatih Prabhakar Salil Prati Andrea Prokhorov Danil Pylvanainen Timo Qi Hairong Qian Gang Raftopoulos Kostas Regazzoni Carlo Regentova Emma Remagnino Paolo Ribeiro Eraldo Robles-Kelly Antonio Ross Arun Samal Ashok Samir Tamer Sandberg Kristian Sarti Augusto Savakis Andreas Schaefer Gerald Scalzo Fabien Scharcanski Jacob Shah Mubarak Shi Pengcheng Shimada Nobutaka Singh Meghna Singh Rahul Skurikhin Alexei Souvenir Su Chung-Yen
XI
National Technical University of Athens, Greece The Chinese University of Hong Kong, Hong Kong Technological Education Institute, Greece University of Minnesota, USA CoreLogic, India Queen Mary University, London, UK Technical University of Crete, Greece LRDE/EPITA, France IBM Research, Brazil University of Technology, Australia LRDE/University of Oulu, Finland Mitsubishi Electric Research Labs, USA DigitalPersona Inc., USA University of Modena and Reggio Emilia, Italy Toyota Research Institute, USA Nokia, Filand University of Tennessee at Knoxville, USA Arizona State University, USA National Technical University of Athens, Greece University of Genoa, Italy University of Nevada, Las Vegas, USA Kingston University, UK Florida Institute of Technology, USA National ICT Australia (NICTA), Australia West Virginia University, USA University of Nebraska, USA Ingersoll Rand Security Technologies, USA Computational Solutions, USA DEI Politecnico di Milano, Italy Rochester Institute of Technology, USA Loughborough University, UK University of California at Los Angeles, USA UFRGS, Brazil University of Central Florida, USA The Hong Kong University of Science and Technology, Hong Kong Ritsumeikan University, Japan University of Alberta, Canada San Francisco State University, USA Los Alamos National Laboratory, USA Richard, University of North Carolina - Charlotte, USA National Taiwan Normal University, Taiwan
XII
Organization
Sugihara Kokichi Sun Zehang Syeda-Mahmood Tanveer Tan Kar Han Tan Tieniu Tavakkoli Alireza Tavares Teoh Eam Khwang Thiran Jean-Philippe Tistarelli Massimo Tong Yan Tsechpenakis Gabriel Tsui T.J. Trucco Emanuele Tubaro Stefano Uhl Andreas Velastin Sergio Verri Alessandro Wang C.L. Charlie Wang Junxian Wang Yunhong Webster Michael Wolff Larry Wong Kenneth Xiang Tao Xue Xinwei Xu Meihe Yang Ming-Hsuan Yang Ruigang Yi Lijun Yu Ting Yu Zeyun Yuan Chunrong Zabulis Xenophon Zhang Yan Cheng Shinko Zhou Huiyu
University of Tokyo, Japan Apple, USA IBM Almaden, USA Hewlett Packard, USA Chinese Academy of Sciences, China University of Houston - Victoria, USA Joao, Universidade do Porto, Portugal Nanyang Technological University, Singapore Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland University of Sassari, Italy University of South Carolina, USA University of Miami, USA Chinese University of Hong Kong, Hong Kong University of Dundee, UK DEI, Politecnico di Milano, Italy Salzburg University, Austria Kingston University London, UK Universit` a di Genova, Italy The Chinese University of Hong Kong, Hong Kong Microsoft, USA Beihang University, China University of Nevada, Reno, USA Equinox Corporation, USA The University of Hong Kong, Hong Kong Queen Mary, University of London, UK Fair Isaac Corporation, USA University of California at Los Angeles, USA University of California at Merced, USA University of Kentucky, USA SUNY at Binghampton, USA GE Global Research, USA University of Wisconsin-Milwaukee, USA University of T¨ ubingen, Germany Foundation for Research and Technology - Hellas (FORTH), Greece Delphi Corporation, USA HRL Labs, USA Queen’s University Belfast, UK
(Area 2) Computer Graphics Abd Rahni Mt Piah Abram Greg Adamo-Villani Nicoletta Agu Emmanuel
Universiti Sains Malaysia, Malaysia Texas Advanced Computing Center, USA Purdue University, USA Worcester Polytechnic Institute, USA
Organization
Andres Eric Artusi Alessandro Baciu George Balcisoy Selim Saffet Barneva Reneta Belyaev Alexander Berberich Eric Bilalis Nicholas Bimber Oliver Bohez Erik Bouatouch Kadi Brimkov Valentin Brown Ross Bruckner Stefan Callahan Steven Chen Min Cheng Irene Choi Min Comba Joao Crawfis Roger Cremer Jim Crossno Patricia Culbertson Bruce Debattista Kurt Deng Zhigang Dick Christian Dingliana John El-Sana Jihad Entezari Alireza Fabian Nathan Fiorio Christophe De Floriani Leila Gaither Kelly Gao Chunyu Geist Robert Gelb Dan Gotz David Gooch Amy Gu David Guerra-Filho Gutemberg Habib Zulfiqar Hadwiger Markus
XIII
Laboratory XLIM-SIC, University of Poitiers, France CaSToRC Cyprus Institute, Cyprus Hong Kong PolyU, Hong Kong Sabanci University, Turkey State University of New York, USA Max-Planck-Institut f¨ ur Informatik, Germany Max Planck Institute, Germany Technical University of Crete, Greece Johannes Kepler University Linz, Austria Asian Institute of Technology, Thailand University of Rennes I, IRISA, France State University of New York, USA Queensland University of Technology, Australia Vienna University of Technology, Austria University of Utah, USA University of Wales Swansea, UK University of Alberta, Canada University of Colorado at Denver, USA Universidade Federal do Rio Grande do Sul, Brazil Ohio State University, USA University of Iowa, USA Sandia National Laboratories, USA HP Labs, USA University of Warwick, UK University of Houston, USA Technical University of Munich, Germany Trinity College, Ireland Ben Gurion University of The Negev, Israel University of Florida, USA Sandia National Laboratories, USA Universit´e Montpellier 2, LIRMM, France University of Genoa, Italy University of Texas at Austin, USA Epson Research and Development, USA Clemson University, USA Hewlett Packard Labs, USA IBM, USA University of Victoria, Canada State University of New York at Stony Brook, USA University of Texas Arlington, USA COMSATS Institute of Information Technology, Lahore, Pakistan KAUST, Saudi Arabia
XIV
Organization
Haller Michael Hamza-Lup Felix Han JungHyun Hand Randall Hao Xuejun Hernandez Jose Tiberio Huang Jian Huang Mao Lin Huang Zhiyong Hussain Muhammad Joaquim Jorge Jones Michael Ju Tao Julier Simon J. Kakadiaris Ioannis Kamberov George Klosowski James Kobbelt Leif Kolingerova Ivana Kuan Hwee Lee Lai Shuhua Lee Chang Ha Lee Tong-Yee Levine Martin Lewis R. Robert Li Frederick Lindstrom Peter Linsen Lars Loviscach Joern Magnor Marcus Majumder Aditi Mantler Stephan Martin Ralph McGraw Tim Meenakshisundaram Gopi Mendoza Cesar Metaxas Dimitris Myles Ashish Nait-Charif Hammadi Nasri Ahmad Noma Tsukasa Okada Yoshihiro Olague Gustavo
Upper Austria University of Applied Sciences, Austria Armstrong Atlantic State University, USA Korea University, Korea Lockheed Martin Corporation, USA Columbia University and NYSPI, USA Universidad de los Andes, Colombia University of Tennessee at Knoxville, USA University of Technology, Australia Institute for Infocomm Research, Singapore King Saud University, Saudi Arabia Instituto Superior Tecnico, Portugal Brigham Young University, USA Washington University, USA University College London, UK University of Houston, USA Stevens Institute of Technology, USA AT&T Labs, USA RWTH Aachen, Germany University of West Bohemia, Czech Republic Bioinformatics Institute, A*STAR, Singapore Virginia State University, USA Chung-Ang University, Korea National Cheng-Kung University, Taiwan McGill University, Canada Washington State University, USA University of Durham, UK Lawrence Livermore National Laboratory, USA Jacobs University, Germany Fachhochschule Bielefeld (University of Applied Sciences), Germany TU Braunschweig, Germany University of California, Irvine, USA VRVis Research Center, Austria Cardiff University, UK West Virginia University, USA University of California-Irvine, USA NaturalMotion Ltd., USA Rutgers University, USA University of Florida, USA University of Dundee, UK American University of Beirut, Lebanon Kyushu Institute of Technology, Japan Kyushu University, Japan CICESE Research Center, Mexico
Organization
Oliveira Manuel M. Ostromoukhov Victor M. Pascucci Valerio Patchett John Peterka Tom Peters Jorg Qin Hong Rautek Peter Razdan Anshuman Renner Gabor Rosen Paul Rosenbaum Rene Rudomin Rushmeier Sander Pedro Sapidis Nickolas Sarfraz Muhammad Scateni Riccardo Schaefer Scott Sequin Carlo Shead Tinothy Sourin Alexei Stamminger Marc Su Wen-Poh Szumilas Lech Tan Kar Han Tarini Marco Teschner Matthias Tsong Ng Tian Umlauf Georg Vanegas Carlos Wald Ingo Wang Sen Wimmer Michael Woodring Jon Wylie Brian Wyman Chris Wyvill Brian Yang Qing-Xiong Yang Ruigang
XV
Universidade Federal do Rio Grande do Sul, Brazil University of Montreal, Canada University of Utah, USA Los Alamons National Lab, USA Argonne National Laboratory, USA University of Florida, USA State University of New York at Stony Brook, USA Vienna University of Technology, Austria Arizona State University, USA Computer and Automation Research Institute, Hungary University of Utah, USA University of California at Davis, USA Isaac, ITESM-CEM, Mexico Holly, Yale University, USA The Hong Kong University of Science and Technology, Hong Kong University of Western Macedonia, Greece Kuwait University, Kuwait University of Cagliari, Italy Texas A&M University, USA University of California-Berkeley, USA Sandia National Laboratories, USA Nanyang Technological University, Singapore REVES/INRIA, France Griffith University, Australia Research Institute for Automation and Measurements, Poland Hewlett Packard, USA Universit` a dell’Insubria (Varese), Italy University of Freiburg, Germany Institute for Infocomm Research, Singapore HTWG Constance, Germany Purdue University, USA University of Utah, USA Kodak, USA Technical University of Vienna, Austria Los Alamos National Laboratory, USA Sandia National Laboratory, USA University of Calgary, Canada University of Iowa, USA University of Illinois at Urbana, Champaign, USA University of Kentucky, USA
XVI
Organization
Ye Duan Yi Beifang Yin Lijun Yoo Terry Yuan Xiaoru Zhang Jian Jun Zara Jiri Zordan Victor
University of Missouri-Columbia, USA Salem State College, USA Binghamton University, USA National Institutes of Health, USA Peking University, China Bournemouth University, UK Czech Technical University in Prague, Czech University of California at Riverside, USA
(Area 3) Virtual Reality Alcaiz Mariano Arns Laura Azuma Robert Balcisoy Selim Behringer Reinhold Bilalis Nicholas Blach Roland Blom Kristopher Boulic Ronan Brady Rachael Brega Jose Remo Ferreira Brown Ross Bruce Thomas Bues Matthias Chen Jian Cheng Irene Coquillart Sabine Craig Alan Cremer Jim Egges Arjan Encarnacao L. Miguel Figueroa Pablo Fox Jesse Friedman Doron Gregory Michelle Gupta Satyandra K. Haller Michael Hamza-Lup Felix Hinkenjann Andre Hollerer Tobias Huang Jian Julier Simon J. Kiyokawa Kiyoshi
Technical University of Valencia, Spain Purdue University, USA Nokia, USA Sabanci University, Turkey Leeds Metropolitan University UK Technical University of Crete, Greece Fraunhofer Institute for Industrial Engineering, Germany University of Barcelona, Spain EPFL, Switzerland Duke University, USA Universidade Estadual Paulista, Brazil Queensland University of Technology, Australia The University of South Australia, Australia Fraunhofer IAO in Stuttgart, Germany Brown University, USA University of Alberta, Canada INRIA, France NCSA University of Illinois at Urbana-Champaign, USA University of Iowa, USA Universiteit Utrecht, The Netherlands University of Louisville, USA Universidad de los Andes, Colombia Stanford University, USA IDC, Israel Pacific Northwest National Lab, USA University of Maryland, USA FH Hagenberg, Austria Armstrong Atlantic State University, USA Bonn-Rhein-Sieg University of Applied Sciences, Germany University of California at Santa Barbara, USA University of Tennessee at Knoxville, USA University College London, UK Osaka University, Japan
Organization
Klosowski James Kozintsev Kuhlen Torsten Lee Cha Liere Robert van Livingston A. Mark Majumder Aditi Malzbender Tom Mantler Stephan Molineros Jose Muller Stefan Olwal Alex Paelke Volker Papka Michael Peli Eli Pettifer Steve Piekarski Wayne Pugmire Dave Qian Gang Raffin Bruno Raij Andrew Reiners Dirk Richir Simon Rodello Ildeberto Sandor Christian Santhanam Anand Sapidis Nickolas Schulze Sherman Bill Slavik Pavel Sourin Alexei Steinicke Frank Su Simon Suma Evan Stamminger Marc Srikanth Manohar Stefani Oliver Sun Hanqiu Varsamidis Thomas Vercher Jean-Louis Wald Ingo Wither Jason
XVII
AT&T Labs, USA Igor, Intel, USA RWTH Aachen University, Germany University of California, Santa Barbara, USA CWI, The Netherlands Naval Research Laboratory, USA University of California, Irvine, USA Hewlett Packard Labs, USA VRVis Research Center, Austria Teledyne Scientific and Imaging, USA University of Koblenz, Germany MIT, USA Institut de Geom`atica, Spain Argonne National Laboratory, USA Harvard University, USA The University of Manchester, UK Qualcomm Bay Area R&D, USA Los Alamos National Lab, USA Arizona State University, USA INRIA, France University of South Florida, USA University of Louisiana, USA Arts et Metiers ParisTech, France University of Sao Paulo, Brazil University of South Australia, Australia University of California at Los Angeles, USA University of Western Macedonia, Greece Jurgen, University of California - San Diego, USA Indiana University, USA Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore University of M¨ unster, Germany Geophysical Fluid Dynamics Laboratory, NOAA, USA University of Southern California, USA REVES/INRIA, France Indian Institute of Science, India COAT-Basel, Switzerland The Chinese University of Hong Kong, Hong Kong Bangor University, UK Universit´e de la M´editerrane, France University of Utah, USA University of California, Santa Barbara, USA
XVIII
Organization
Yu Ka Chun Yuan Chunrong Zachmann Gabriel Zara Jiri Zhang Hui Zhao Ye
Denver Museum of Nature and Science, USA University of T¨ ubingen, Germany Clausthal University, Germany Czech Technical University in Prague, Czech Republic Indiana University, USA Kent State University, USA
(Area 4) Visualization Andrienko Gennady Avila Lisa Apperley Mark Balzs Csbfalvi Brady Rachael Benes Bedrich Bilalis Nicholas Bonneau Georges-Pierre Brown Ross Bhler Katja Callahan Steven Chen Jian Chen Min Cheng Irene Chourasia Amit Coming Daniel Dana Kristin Daniels Joel Dick Christian Doleisch Helmut Duan Ye Dwyer Tim Ebert David Entezari Alireza Ertl Thomas De Floriani Leila Fujishiro Issei Geist Robert Goebel Randy Gotz David Grinstein Georges Goebel Randy Gregory Michelle Hadwiger Helmut Markus Hagen Hans
Fraunhofer Institute IAIS, Germany Kitware, USA University of Waikato, New Zealand Budapest University of Technology and Economics, Hungary Duke University, USA Purdue University, USA Technical University of Crete, Greece Grenoble Universit´e , France Queensland University of Technology, Australia VRVIS, Austria University of Utah, USA Brown University, USA University of Wales Swansea, UK University of Alberta, Canada University of California - San Diego, USA Desert Research Institute, USA Rutgers University, USA University of Utah, USA Technical University of Munich, Germany VRVis Research Center, Austria University of Missouri-Columbia, USA Monash University, Australia Purdue University, USA University of Florida, USA University of Stuttgart, Germany University of Maryland, USA Keio University, Japan Clemson University, USA University of Alberta, Canada IBM, USA University of Massachusetts Lowell, USA University of Alberta, Canada Pacific Northwest National Lab, USA VRVis Research Center, Austria Technical University of Kaiserslautern, Germany
Organization
Hamza-Lup Felix Heer Jeffrey Hege Hans-Christian Hochheiser Harry Hollerer Tobias Hong Lichan Hotz Ingrid Joshi Alark Julier Simon J. Kao David Kohlhammer Jrn Kosara Robert Laramee Robert Lee Chang Ha Lewis R. Robert Liere Robert van Lim Ik Soo Linsen Lars Liu Zhanping Ma Kwan-Liu Maeder Anthony Majumder Aditi Malpica Jose Masutani Yoshitaka Matkovic Kresimir McCaffrey James McGraw Tim Melanon Guy Miksch Silvia Monroe Laura Morie Jacki Mueller Klaus Museth Ken Paelke Volker Papka Michael Pettifer Steve Pugmire Dave Rabin Robert Raffin Bruno Razdan Anshuman Rhyne Theresa-Marie Rosenbaum Rene Santhanam Anand Scheuermann Gerik
XIX
Armstrong Atlantic State University, USA Armstrong University of California at Berkeley, USA Zuse Institute Berlin, Germany University of Pittsburgh, USA University of California at Santa Barbara, USA Palo Alto Research Center, USA Zuse Institute Berlin, Germany Yale University, USA University College London, UK NASA Ames Research Center, USA Fraunhofer Institut, Germany University of North Carolina at Charlotte, USA Swansea University, UK Chung-Ang University, Korea Washington State University, USA CWI, The Netherlands Bangor University, UK Jacobs University, Germany University of Pennsylvania, USA University of California-Davis, USA University of Western Sydney, Australia University of California, Irvine, USA Alcala University, Spain The University of Tokyo Hospital, Japan VRVis Forschungs-GmbH, Austria Microsoft Research / Volt VTE, USA West Virginia University, USA CNRS UMR 5800 LaBRI and INRIA Bordeaux Sud-Ouest, France Vienna University of Technology, Austria Los Alamos National Labs, USA University of Southern California, USA Stony Brook University, USA Link¨ oping University, Sweden Institut de Geom`atica, Spain Argonne National Laboratory, USA The University of Manchester, UK Los Alamos National Lab, USA University of Wisconsin at Madison, USA INRIA, France Arizona State University, USA North Carolina State University, USA University of California at Davis, USA University of California at Los Angeles, USA University of Leipzig, Germany
XX
Organization
Shead Tinothy Shen Han-Wei Sips Mike Slavik Pavel Sourin Alexei Thakur Sidharth Theisel Holger Thiele Olaf Toledo de Rodrigo Tricoche Xavier Umlauf Georg Viegas Fernanda Wald Ingo Wan Ming Weinkauf Tino Weiskopf Daniel Wischgoll Thomas Wylie Brian Yeasin Mohammed Yuan Xiaoru Zachmann Gabriel Zhang Hui Zhao Ye Zhukov Leonid
Sandia National Laboratories, USA Ohio State University, USA Stanford University, USA Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore Renaissance Computing Institute (RENCI), USA University of Magdeburg, Germany University of Mannheim, Germany Petrobras PUC-RIO, Brazil Purdue University, USA HTWG Constance, Germany IBM, USA University of Utah, USA Boeing Phantom Works, USA Courant Institute, New York University, USA University of Stuttgart, Germany Wright State University, USA Sandia National Laboratory, USA Memphis University, USA Peking University, China Clausthal University, Germany Indiana University, USA Kent State University, USA Caltech, USA
ISVC 2011 Special Tracks 1. 3D Mapping, Modeling and Surface Reconstruction Organizers Nefian Ara Edwards Laurence Huertas Andres
Carnegie Mellon University/NASA Ames Research Center, USA NASA Ames Research Center, USA NASA Jet Propulsion Lab, USA
Program Committee Bradski Gary Zakhor Avideh Cavallaro Andrea Bouguet Jean-Yves
Willow Garage, USA University of California at Berkeley, USA University Queen Mary, London, UK Google, USA
Organization
XXI
2. Best Practices in Teaching Visual Computing Organizers Albu Alexandra Branzan Bebis George
University of Victoria, Canada University of Nevada, Reno, USA and King Saud University, Saudi Arabia
Program Committee Antonacopoulos Apostolos Bellon Olga Regina Pereira Bowyer Kevin Crawfis Roger Hammoud Riad Kakadiaris Ioannis Llads Josep Sarkar Sudeep
University of Salford, UK Universidade Federal do Parana, Brazil University of Notre Dame, USA Ohio State University, USA DynaVox Systems, USA University of Houston, USA Universitat Autonoma de Barcelona, Spain University of South Florida, USA
3. Immersive Visualization Organizers Sherman Bill Wernert Eric OLeary Patrick Coming Daniel
Indiana University, USA Indiana University, USA University of Calgary, Canada Desert Research Institute, USA
Program Committee Su Simon Folcomer Samuel Brady Rachael Johnson Andy Kreylos Oliver Will Jeffrey Moreland John Leigh Jason Schulze Jurgen Sanyal Jibonananda Stone John Kuhlen Torsten
Princeton University, USA Brown University, USA Duke University, USA University of Illinois at Chicago, USA University of California at Davis, USA Valparaiso University, USA Purdue University, Calumet, USA University of Illinois, Chicago, USA University of California, San Diego, USA Mississippi State University, USA University of Illinois, Urbana-Champaign, USA Aachen University, Germany
4. Computational Bioimaging Organizers Tavares Joo Manuel R.S. Natal Jorge Renato Cunha Alexandre
University of Porto, Portugal University of Porto, Portugal Caltech, USA
XXII
Organization
Program Committee Santis De Alberto Reis Ana Mafalda Barrutia Arrate Muoz Calvo Begoa Constantinou Christons Iacoviello Daniela Ushizima Daniela Ziou Djemel Pires Eduardo Borges Sgallari Fiorella Perales Francisco Qiu Guoping Hanchuan Peng Pistori Hemerson Yanovsky Igor Corso Jason Maldonado Javier Melenchn Marques Jorge S. Aznar Jose M. Garca Vese Luminita Reis Lus Paulo Thiriet Marc Mahmoud El-Sakka Hidalgo Manuel Gonzlez Gurcan Metin N. Dubois Patrick Barneva Reneta P. Bellotti Roberto Tangaro Sabina Silva Susana Branco Brimkov Valentin Zhan Yongjie
Universit` a degli Studi di Roma “La Sapienza”, Italy Instituto de Ciˆencias Biom´edicas Abel Salazar, Portugal University of Navarra, Spain University of Zaragoza, Spain Stanford University, USA Universit` a degli Studi di Roma “La Sapienza”, Italy Lawrence Berkeley National Lab, USA University of Sherbrooke, Canada Instituto Superior T´ecnico, Portugal University of Bologna, Italy Balearic Islands University, Spain University of Nottingham, UK Howard Hughes Medical Institute, USA Dom Bosco Catholic University, Brazil Jet Propulsion Laboratory, USA SUNY at Buffalo, USA Open University of Catalonia, Spain Instituto Superior T´ecnico, Portugal University of Zaragoza, Spain University of California at Los Angeles, USA University of Porto, Portugal Universit´e Pierre et Marie Curie (Paris VI), France The University of Western Ontario London, Canada Balearic Islands University, Spain Ohio State University, USA Institut de Technologie M´edicale, France State University of New York, USA University of Bari, Italy University of Bari, Italy University of Lisbon, Portugal State University of New York, USA Carnegie Mellon University, USA
5. Interactive Visualization in Novel and Heterogeneous Display Environments Organizers Rosenbaum Rene Tominski Christian
University of California, Davis, USA University of Rostock, Germany
Organization
XXIII
Program Committee Isenberg Petra Isenberg Tobias Kerren Andreas Majumder Aditi Quigley Aaron Schumann Heidrun Sips Mike Slavik Pavel Weiskopf Daniel
INRIA, France University of Groningen, The Netherlands and CNRS/INRIA, France Linnaeus University, Sweden University of California, Irvine, USA University of St. Andrews, UK University of Rostock, Germany GFZ Potsdam, Germany Czech Technical University in Prague, Czech Republic University of Stuttgart, Germany
Additional Reviewers Payet Nadia Hong Wei
Hewlett Packard Labs, USA Hewlett Packard Labs, USA
XXIV
Organization
Organizing Institutions and Sponsors
Table of Contents – Part II
ST: Immersive Visualization Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John E. Stone, Kirby L. Vandivort, and Klaus Schulten The OmegaDesk: Towards a Hybrid 2D and 3D Work Desk . . . . . . . . . . . Alessandro Febretti, Victor A. Mateevitsi, Dennis Chau, Arthur Nishimoto, Brad McGinnis, Jakub Misterka, Andrew Johnson, and Jason Leigh Disambiguation of Horizontal Direction for Video Conference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mabel Mengzi Zhang, Seth Rotkin, and J¨ urgen P. Schulze Immersive Visualization and Interactive Analysis of Ground Penetrating Radar Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthew R. Sgambati, Steven Koepnick, Daniel S. Coming, Nicholas Lancaster, and Frederick C. Harris Jr.
1 13
24
33
Handymap: A Selection Interface for Cluttered VR Environments Using a Tracked Hand-Held Touch Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mores Prachyabrued, David L. Ducrest, and Christoph W. Borst
45
Virtual Interrupted Suturing Exercise with the Endo Stitch Suturing Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukitti Punak, Sergei Kurenov, and William Cance
55
Applications New Image Steganography via Secret-Fragment-Visible Mosaic Images by Nearly-Reversible Color Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . Ya-Lin Li and Wen-Hsiang Tsai
64
Adaptive and Nonlinear Techniques for Visibility Improvement of Hazy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saibabu Arigela and Vijayan K. Asari
75
Linear Clutter Removal from Urban Panoramas . . . . . . . . . . . . . . . . . . . . . . Mahsa Kamali, Eyal Ofek, Forrest Iandola, Ido Omer, and John C. Hart
85
Efficient Starting Point Decision for Enhanced Hexagonal Search . . . . . . . Do-Kyung Lee and Je-Chang Jeong
95
XXVI
Table of Contents – Part II
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X. Zabulis, P. Koutlemanis, H. Baltzakis, and D. Grammenos
104
Object Detection and Recognition II Material Information Acquisition Using a ToF Range Sensor for Interactive Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abdul Mannan, Hisato Fukuda, Yoshinori Kobayashi, and Yoshinori Kuno A Neuromorphic Approach to Object Detection and Recognition in Airborne Videos with Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Chen, Deepak Khosla, David Huber, Kyungnam Kim, and Shinko Y. Cheng
116
126
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Lam and J.M. Hans du Buf
136
3D Facial Feature Detection Using Iso-Geodesic Stripes and Shape-Index Based Integral Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James Allen, Nikhil Karkera, and Lijun Yin
148
Hybrid Face Recognition Based on Real-Time Multi-camera Stereo-Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Hensler, K. Denker, M. Franz, and G. Umlauf
158
Learning Image Transformations without Training Examples . . . . . . . . . . Sergey Pankov
168
Virtual Reality Investigation of Secondary Views in a Multimodal VR Environment: 3D Lenses, Windows, and Mirrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phanidhar Bezawada Raghupathy and Christoph W. Borst
180
Synthesizing Physics-Based Vortex and Collision Sound in Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damon Shing-Min Liu, Ting-Wei Cheng, and Yu-Cheng Hsieh
190
BlenSor: Blender Sensor Simulation Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . Michael Gschwandtner, Roland Kwitt, Andreas Uhl, and Wolfgang Pree
199
Fuzzy Logic Based Sensor Fusion for Accurate Tracking . . . . . . . . . . . . . . . Ujwal Koneru, Sangram Redkar, and Anshuman Razdan
209
Table of Contents – Part II
A Flight Tested Wake Turbulence Aware Altimeter . . . . . . . . . . . . . . . . . . . Scott Nykl, Chad Mourning, Nikhil Ghandi, and David Chelberg A Virtual Excavation: Combining 3D Immersive Virtual Reality and Geophysical Surveying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Albert Yu-Min Lin, Alexandre Novo, Philip P. Weber, Gianfranco Morelli, Dean Goodman, and J¨ urgen P. Schulze
XXVII
219
229
ST: Best Practices in Teaching Visual Computing Experiences in Disseminating Educational Visualizations . . . . . . . . . . . . . . Nathan Andrysco, Paul Rosen, Voicu Popescu, Bedˇrich Beneˇs, and Kevin Robert Gurney Branches and Roots: Project Selection in Graphics Courses for Fourth Year Computer Science Undergraduates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.D. Jones Raydiance: A Tangible Interface for Teaching Computer Vision . . . . . . . . Paul Reimer, Alexandra Branzan Albu, and George Tzanetakis
239
249
259
Poster Session Subvoxel Super-Resolution of Volumetric Motion Field Using General Order Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koji Kashu, Atsushi Imiya, and Tomoya Sakai Architectural Style Classification of Building Facade Windows . . . . . . . . . Gayane Shalunts, Yll Haxhimusa, and Robert Sablatnig Are Current Monocular Computer Vision Systems for Human Action Recognition Suitable for Visual Surveillance Applications? . . . . . . . . . . . . Jean-Christophe Nebel, Michal Lewandowski, J´erˆ ome Th´evenon, Francisco Mart´ınez, and Sergio Velastin Near-Optimal Time Function for Secure Dynamic Visual Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V. Petrauskiene, J. Ragulskiene, E. Sakyte, and M. Ragulskis Vision-Based Horizon Detection and Target Tracking for UAVs . . . . . . . . Yingju Chen, Ahmad Abushakra, and Jeongkyu Lee Bag-of-Visual-Words Approach to Abnormal Image Detection In Wireless Capsule Endoscopy Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sae Hwang
270
280
290
300
310
320
XXVIII
Table of Contents – Part II
A Relevance Feedback Framework for Image Retrieval Based on Ant Colony Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guang-Peng Chen, Yu-Bin Yang, Yao Zhang, Ling-Yan Pan, Yang Gao, and Lin Shang A Closed Form Algorithm for Superresolution . . . . . . . . . . . . . . . . . . . . . . . Marcelo O. Camponez, Evandro O.T. Salles, and M´ ario Sarcinelli-Filho A Parallel Hybrid Video Coding Method Based on Noncausal Prediction with Multimode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cui Wang and Yoshinori Hatori Color-Based Extensions to MSERs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aaron Chavez and David Gustafson 3D Model Retrieval Using the Histogram of Orientation of Suggestive Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sang Min Yoon and Arjan Kuijper Adaptive Discrete Laplace Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe Fiorio, Christian Mercat, and Fr´ed´eric Rieux Stereo Vision-Based Improving Cascade Classifier Learning for Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonghwan Kim, Chung-Hee Lee, Young-Chul Lim, and Soon Kwon Towards a Universal and Limited Visual Vocabulary . . . . . . . . . . . . . . . . . . Jian Hou, Zhan-Shen Feng, Yong Yang, and Nai-Ming Qi
328
338
348 358
367 377
387 398
Human Body Shape and Motion Tracking by Hierarchical Weighted ICP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jia Chen, Xiaojun Wu, Michael Yu Wang, and Fuqin Deng
408
Multi-view Head Detection and Tracking with Long Range Capability for Social Navigation Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Razali Tomari, Yoshinori Kobayashi, and Yoshinori Kuno
418
A Fast Video Stabilization System Based on Speeded-up Robust Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minqi Zhou and Vijayan K. Asari
428
Detection of Defect in Textile Fabrics Using Optimal Gabor Wavelet Network and Two-Dimensional PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Srikaew, K. Attakitmongcol, P. Kumsawat, and W. Kidsang
436
Introducing Confidence Maps to Increase the Performance of Person Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Zweng and Martin Kampel
446
Table of Contents – Part II
Monocular Online Learning for Road Region Labeling and Object Detection from a Moving Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chung-Ching Lin and Marilyn Wolf
XXIX
456
Detection and Tracking Faces in Unconstrained Color Video Streams . . . Corn´elia Janayna P. Passarinho, Evandro Ottoni T. Salles, and M´ ario Sarcinelli-Filho
466
Model-Based Chart Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ales Mishchenko and Natalia Vassilieva
476
Kernel-Based Motion-Blurred Target Tracking . . . . . . . . . . . . . . . . . . . . . . . Yi Wu, Jing Hu, Feng Li, Erkang Cheng, Jingyi Yu, and Haibin Ling
486
Robust Foreground Detection in Videos Using Adaptive Color Histogram Thresholding and Shadow Removal . . . . . . . . . . . . . . . . . . . . . . . Akintola Kolawole and Alireza Tavakkoli
496
Deformable Object Shape Refinement and Tracking Using Graph Cuts and Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mehmet Kemal Kocamaz, Yan Lu, and Christopher Rasmussen
506
A Non-intrusive Method for Copy-Move Forgery Detection . . . . . . . . . . . . Najah Muhammad, Muhammad Hussain, Ghulam Muhamad, and George Bebis An Investigation into the Use of Partial Face in the Mobile Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. Mallikarjuna Rao, Praveen Kumar, G. Vijaya Kumari, Amit Pande, and G.R. Babu
516
526
Optimal Multiclass Classifier Threshold Estimation with Particle Swarm Optimization for Visual Object Recognition . . . . . . . . . . . . . . . . . . Shinko Y. Cheng, Yang Chen, Deepak Khosla, and Kyungnam Kim
536
A Parameter-Free Locality Sensitive Discriminant Analysis and Its Application to Coarse 3D Head Pose Estimation . . . . . . . . . . . . . . . . . . . . . A. Bosaghzadeh and F. Dornaika
545
Image Set-Based Hand Shape Recognition Using Camera Selection Driven by Multi-class AdaBoosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Ohkawa, Chendra Hadi Suryanto, and Kazuhiro Fukui
555
Image Segmentation Based on k -Means Clustering and Energy-Transfer Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Gaura, Eduard Sojka, and Michal Krumnikl
567
SERP: SURF Enhancer for Repeated Pattern . . . . . . . . . . . . . . . . . . . . . . . Seung Jun Mok, Kyungboo Jung, Dong Wook Ko, Sang Hwa Lee, and Byung-Uk Choi
578
XXX
Table of Contents – Part II
Shape Abstraction through Multiple Optimal Solutions . . . . . . . . . . . . . . . Marlen Akimaliev and M. Fatih Demirci
588
Evaluating Feature Combination in Object Classification . . . . . . . . . . . . . . Jian Hou, Bo-Ping Zhang, Nai-Ming Qi, and Yong Yang
597
Solving Geometric Co-registration Problem of Multi-spectral Remote Sensing Imagery Using SIFT-Based Features toward Precise Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mostafa Abdelrahman, Asem Ali, Shireen Elhabian, and Aly A. Farag Color Compensation Using Nonlinear Luminance-RGB Component Curve of a Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sejung Yang, Yoon-Ah Kim, Chaerin Kang, and Byung-Uk Lee Augmenting Heteronanostructure Visualization with Haptic Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michel Abdul-Massih, Bedˇrich Beneˇs, Tong Zhang, Christopher Platzer, William Leavenworth, Huilong Zhuo, Edwin R. Garc´ıa, and Zhiwen Liang An Analysis of Impostor Based Level of Detail Approximations for LIDAR Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chad Mourning, Scott Nykl, and David Chelberg UI Generation for Data Visualisation in Heterogenous Environment . . . . Miroslav Macik, Martin Klima, and Pavel Slavik An Open-Source Medical Image Processing and Visualization Tool to Analyze Cardiac SPECT Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Roberto Pereira de Paula, Carlos da Silva dos Santos, Marco Antonio Gutierrez, and Roberto Hirata Jr. CollisionExplorer: A Tool for Visualizing Droplet Collisions in a Turbulent Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.V. Rohith, Hossein Parishani, Orlando Ayala, Lian-Ping Wang, and Chandra Kambhamettu A Multi Level Time Model for Interactive Multiple Dataset Visualization: The Dataset Sequencer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Beer, Gerrit Garbereder, Tobias Meisen, Rudolf Reinhard, and Torsten Kuhlen Automatic Generation of Aesthetic Patterns with the Use of Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Gdawiec, Wieslaw Kotarski, and Agnieszka Lisowska
607
617
627
637
647
659
669
681
691
Table of Contents – Part II
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Gat, Alexandra Branzan Albu, Daniel German, and Eric Higgs Controllable Simulation of Particle System . . . . . . . . . . . . . . . . . . . . . . . . . . Muhammad Rusdi Syamsuddin and Jinwook Kim
XXXI
701
715
3D-City Modeling: A Semi-Automatic Framework for Integrating Different Terrain Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mattias Roup´e and Mikael Johansson
725
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
735
Table of Contents – Part I
ST: Computational Bioimaging EM+TV Based Reconstruction for Cone-Beam CT with Reduced Radiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming Yan, Jianwen Chen, Luminita A. Vese, John Villasenor, Alex Bui, and Jason Cong A Localization Framework under Non-rigid Deformation for Robotic Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiang Xiang Global Image Registration by Fast Random Projection . . . . . . . . . . . . . . . Hayato Itoh, Shuang Lu, Tomoya Sakai, and Atsushi Imiya
1
11
23
EM-Type Algorithms for Image Reconstruction with Background Emission and Poisson Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming Yan
33
Region-Based Segmentation of Parasites for High-throughput Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asher Moody-Davis, Laurent Mennillo, and Rahul Singh
43
Computer Graphics I Adaptive Coded Aperture Photography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oliver Bimber, Haroon Qureshi, Anselm Grundh¨ ofer, Max Grosse, and Daniel Danch
54
Display Pixel Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clemens Birklbauer, Max Grosse, Anselm Grundh¨ ofer, Tianlun Liu, and Oliver Bimber
66
Image Relighting by Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao Teng and Tat-Jen Cham
78
Generating EPI Representations of 4D Light Fields with a Single Lens Focused Plenoptic Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sven Wanner, Janis Fehr, and Bernd J¨ ahne
90
MethMorph: Simulating Facial Deformation Due to Methamphatamine Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mahsa Kamali, Forrest N. Iandola, Hui Fang, and John C. Hart
102
XXXIV
Table of Contents – Part I
Motion and Tracking I Segmentation-Free, Area-Based Articulated Object Tracking . . . . . . . . . . . Daniel Mohr and Gabriel Zachmann
112
An Attempt to Segment Foreground in Dynamic Scenes . . . . . . . . . . . . . . . Xiang Xiang
124
From Saliency to Eye Gaze: Embodied Visual Selection for a Pan-Tilt-Based Robotic Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matei Mancas, Fiora Pirri, and Matia Pizzoli
135
Adaptive Two-Step Adjustable Partial Distortion Search Algorithm for Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yonghoon Kim, Dokyung Lee, and Jechang Jeong
147
Feature Trajectory Retrieval with Application to Accurate Structure and Motion Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Cordes, Oliver M¨ uller, Bodo Rosenhahn, and J¨ orn Ostermann
156
Distortion Compensation for Movement Detection Based on Dense Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josef Maier and Kristian Ambrosch
168
Segmentation Free Boundary Conditions Active Contours with Applications for Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michal Shemesh and Ohad Ben-Shahar
180
Evolving Content-Driven Superpixels for Accurate Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard J. Lowe and Mark S. Nixon
192
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillaume Cerutti, Laure Tougne, Antoine Vacavant, and Didier Coquin
202
Avoiding Mesh Folding in 3D Optimal Surface Segmentation . . . . . . . . . . Christian Bauer, Shanhui Sun, and Reinhard Beichel
214
High Level Video Temporal Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruxandra Tapu and Titus Zaharia
224
Embedding Gestalt Laws on Conditional Random Field for Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olfa Besbes, Nozha Boujemaa, and Ziad Belhadj
236
Table of Contents – Part I
Higher Order Markov Networks for Model Estimation . . . . . . . . . . . . . . . . Toufiq Parag and Ahmed Elgammal
XXXV
246
Visualization I Interactive Object Graphs for Debuggers with Improved Visualization, Inspection and Configuration Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthony Savidis and Nikos Koutsopoulos
259
GPU-Based Ray Casting of Stacked Out-of-Core Height Fields . . . . . . . . . Christopher Lux and Bernd Fr¨ ohlich
269
Multi-View Stereo Point Clouds Visualization . . . . . . . . . . . . . . . . . . . . . . . Yi Gong and Yuan-Fang Wang
281
Depth Map Enhancement Using Adaptive Steering Kernel Regression Based on Distance Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sung-Yeol Kim, Woon Cho, Andreas Koschan, and Mongi A. Abidi Indented Pixel Tree Browser for Exploring Huge Hierarchies . . . . . . . . . . . Michael Burch, Hansj¨ org Schmauder, and Daniel Weiskopf
291
301
ST: 3D Mapping, Modeling and Surface Reconstruction I Towards Realtime Handheld MonoSLAM in Dynamic Environments . . . . Samunda Perera and Ajith Pasqual Registration of 3D Geometric Model and Color Images Using SIFT and Range Intensity Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Inomata, Kenji Terabayashi, Kazunori Umeda, and Guy Godin Denoising Time-Of-Flight Data with Adaptive Total Variation . . . . . . . . . Frank Lenzen, Henrik Sch¨ afer, and Christoph Garbe Efficient City-Sized 3D Reconstruction from Ultra-High Resolution Aerial and Ground Video Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandru N. Vasile, Luke J. Skelly, Karl Ni, Richard Heinrichs, and Octavia Camps Non-Parametric Sequential Frame Decimation for Scene Reconstruction in Low-Memory Streaming Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Knoblauch, Mauricio Hess-Flores, Mark A. Duchaineau, Kenneth I. Joy, and Falko Kuester
313
325 337
347
359
XXXVI
Table of Contents – Part I
Biomedical Imaging Ground Truth Estimation by Maximizing Topological Agreements in Electron Microscopy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huei-Fang Yang and Yoonsuck Choe Segmentation and Cell Tracking of Breast Cancer Cells . . . . . . . . . . . . . . . Adele P. Peskin, Daniel J. Hoeppner, and Christina H. Stuelten
371 381
Registration for 3D Morphological Comparison of Brain Aneurysm Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carl Lederman, Luminita Vese, and Aichi Chien
392
An Interactive Editing Framework for Electron Microscopy Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huei-Fang Yang and Yoonsuck Choe
400
Retinal Vessel Extraction Using First-Order Derivative of Gaussian and Morphological Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, Christopher G. Owen, Alicja R. Rudnicka, and S.A. Barman
410
Computer Graphics II High-Quality Shadows with Improved Paraboloid Mapping . . . . . . . . . . . . Juraj Vanek, Jan Navr´ atil, Adam Herout, and Pavel Zemˇc´ık
421
Terramechanics Based Terrain Deformation for Real-Time Off-Road Vehicle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Zhu, Xiao Chen, and G. Scott Owen
431
An Approach to Point Based Approximate Color Bleeding with Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher J. Gibson and Zo¨e J. Wood
441
3D Reconstruction of Buildings with Automatic Facade Refinement . . . . C. Larsen and T.B. Moeslund Surface Reconstruction of Maltese Cisterns Using ROV Sonar Data for Archeological Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Forney, J. Forrester, B. Bagley, W. McVicker, J. White, T. Smith, J. Batryn, A. Gonzalez, J. Lehr, T. Gambin, C.M. Clark, and Z.J. Wood
451
461
ST: Interactive Visualization in Novel and Heterogeneous Display Environments Supporting Display Scalability by Redundant Mapping . . . . . . . . . . . . . . . Axel Radloff, Martin Luboschik, Mike Sips, and Heidrun Schumann
472
Table of Contents – Part I
XXXVII
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device for Construction and Browsing of Human-Reachable Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-Tung Kuo and Wen-Hsiang Tsai Physical Navigation to Support Graph Exploration on a Large High-Resolution Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anke Lehmann, Heidrun Schumann, Oliver Staadt, and Christian Tominski An Extensible Interactive 3D Visualization Framework for N-Dimensional Datasets Used in Heterogeneous Software Display Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nathaniel Rossol, Irene Cheng, John Berezowski, and Iqbal Jamal Improving Collaborative Visualization of Structural Biology . . . . . . . . . . . Aaron Bryden, George N. Phillips Jr., Yoram Griguer, Jordan Moxon, and Michael Gleicher Involve Me and I Will Understand!–Abstract Data Visualization in Immersive Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ren´e Rosenbaum, Jeremy Bottleson, Zhuiguang Liu, and Bernd Hamann
484
496
508 518
530
Object Detection and Recognition I Automated Fish Taxonomy Using Evolution-COnstructed Features . . . . . Kirt Lillywhite and Dah-Jye Lee
541
A Monocular Human Detection System Based on EOH and Oriented LBP Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yingdong Ma, Xiankai Chen, Liu Jin, and George Chen
551
Using the Shadow as a Single Feature for Real-Time Monocular Vehicle Pose Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dennis Rosebrock, Markus Rilk, Jens Spehr, and Friedrich M. Wahl
563
Multi-class Object Layout with Unsupervised Image Classification and Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ser-Nam Lim, Gianfranco Doretto, and Jens Rittscher
573
Efficient Detection of Consecutive Facial Expression Apices Using Biologically Based Log-Normal Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zakia Hammal
586
DTTM: A Discriminative Temporal Topic Model for Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lifeng Shang, Kwok-Ping Chan, and Guodong Pan
596
XXXVIII
Table of Contents – Part I
Visualization II Direct Spherical Parameterization of 3D Triangular Meshes Using Local Flattening Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogdan Mocanu and Titus Zaharia
607
Segmentation and Visualization of Multivariate Features Using Feature-Local Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kenny Gruchalla, Mark Rast, Elizabeth Bradley, and Pablo Mininni
619
Magic Marker: A Color Analytics Interface for Image Annotation . . . . . . Supriya Garg, Kshitij Padalkar, and Klaus Mueller BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julian Heinrich, Robert Seifert, Michael Burch, and Daniel Weiskopf Visualizing Translation Variation: Shakespeare’s Othello . . . . . . . . . . . . . . Zhao Geng, Robert S. Laramee, Tom Cheesman, Alison Ehrmann, and David M. Berry
629
641 653
ST: 3D Mapping, Modeling and Surface Reconstruction II 3D Object Modeling with Graphics Hardware Acceleration and Unsupervised Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Felipe Montoya–Franco, Andr´es F. Serna–Morales, and Flavio Prieto
664
Event-Based Stereo Matching Approaches for Frameless Address Event Stereo Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J¨ urgen Kogler, Martin Humenberger, and Christoph Sulzbachner
674
A Variational Model for the Restoration of MR Images Corrupted by Blur and Rician Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pascal Getreuer, Melissa Tong, and Luminita A. Vese
686
Robust Classification of Curvilinear and Surface-Like Structures in 3d Point Cloud Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mahsa Kamali, Matei Stroila, Jason Cho, Eric Shaffer, and John C. Hart Orthographic Stereo Correlator on the Terrain Model for Apollo Metric Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taemin Kim, Kyle Husmann, Zachary Moratto, and Ara V. Nefian
699
709
Table of Contents – Part I
XXXIX
Motion and Tracking II Collaborative Track Analysis, Data Cleansing, and Labeling . . . . . . . . . . . George Kamberov, Gerda Kamberova, Matt Burlick, Lazaros Karydas, and Bart Luczynski
718
Time to Collision and Collision Risk Estimation from Local Scale and Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shrinivas Pundlik, Eli Peli, and Gang Luo
728
Visual Tracking Based on Log-Euclidean Riemannian Sparse Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi Wu, Haibin Ling, Erik Blasch, Li Bai, and Genshe Chen
738
Panoramic Background Generation and Abnormal Behavior Detection in PTZ Camera Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sang-Hyun Cho and Hang-Bong Kang
748
Computing Range Flow from Multi-modal Kinect Data . . . . . . . . . . . . . . . Jens-Malte Gottfried, Janis Fehr, and Christoph S. Garbe
758
Real-Time Object Tracking on iPhone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amin Heidari and Parham Aarabi
768
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
779
Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories John E. Stone1 , Kirby L. Vandivort1 , and Klaus Schulten1,2 1
2
Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign Department of Physics, University of Illinois at Urbana-Champaign
Abstract. Atomistic molecular dynamics (MD) simulations of biomolecules provide insight into their physical mechanisms and potential as drug targets. Unfortunately, such simulations are extremely demanding in terms of computation, storage, and visualization. Immersive visualization environments permit fast, intuitive exploration of the pharmacological potential, but add further demands on resources. We describe the design and application of out-of-core visualization techniques for large-size and long-timescale MD simulations involving many terabytes of data, including in particular: fast regeneration of molecular representations, atom selection mechanisms, out-of-core optimized MD trajectory file formats, and multithreaded programming techniques. Our approach leverages technological advances in commodity solid state disk (SSD) devices, to enable trajectory animation rates for large structures that were previously unachievable except by in-core approaches, while maintaining full visualization flexibility. The out-of-core visualization techniques are implemented and evaluated in VMD, a widely used molecular visualization tool.
1
Introduction
Biomedically-relevant cellular processes take place in molecular assemblies made of millions to hundreds of millions of atoms. Atomistic molecular dynamics (MD) simulations of these structures provide insight into their physical mechanisms and potential as drug targets. Unfortunately, such simulations are extremely demanding in terms of computation, storage, and visualization. Biomedical researchers wish to understand the structure and function of cellular machinery including individual protein subunits made of thousands of atoms, as well as the entire machine made of millions of atoms. By observing the details of threedimensional molecular structures and their dynamics through the MD “computational microscope”, researchers can gain insight into biological processes that are too fast to observe first hand, or that occur in the dense environment of living cells that cannot be seen with even the most advanced experimental microscopes. Immersive visualization environments that combine high-framerate stereoscopic display and six-degree-of-freedom motion input permit views of complex G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 1–12, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
J.E. Stone, K.L. Vandivort, and K. Schulten
molecular structures and their dynamics, but the application of these techniques to large molecular models and to long-timescale simulation trajectories remains challenging due to the sheer size of the data. State-of-the-art petascale MD simulations produce terabytes of data and are far too large to visualize using in-core approaches, even on high-end hardware. Although it is possible to load a small subset of an MD trajectory by culling out portions of the simulated structure or skipping hundreds or thousands of trajectory frames, this can lead to detailed or rare events going unseen. It is thus preferable that immersive molecular visualization tools enable the user to make these judgments entirely interactively, without resorting to off-line preprocessing techniques. The most recent generation of commodity solid state disks (SSDs) provide sequential read I/O bandwidths that are up to five times faster than traditional magnetic hard drives – up to 500MB/sec for a single SSD drive. These SSD I/O rates are fast enough to enable out-of-core trajectory streaming to cross a key performance threshold, enabling the computational microscope to migrate from use only for small molecules or batch mode analysis work into the realm of large scale immersive visualization. The raw performance of SSD technology is insufficient to enable interactive immersive visualization by itself. Many details must be taken into consideration in the design of the trajectory I/O system, for example, the on-disk and in-memory organization of MD trajectory data, and the way that trajectory data is processed within the visualization code, leading to specific optimizations for interactive out-of-core MD trajectory animation. Out-of-core immersive visualization techniques have previously been used for visualization of LiDAR data [1], particle traces [2], and large-scale static scenes [3]. Our effort differs from the out-of-core techniques used in other domains in that we do not perform off-line preprocessing of datasets prior to visualization, and our techniques apply both to large systems and to long MD simulation trajectories. Out-of-core MD trajectory loading techniques have been implemented in both published and unpublished works in the past [4], but to our knowledge these efforts have not attempted to achieve the I/O and rendering rates required for smooth playback of biomolecular structures containing millions of atoms within an immersive environment, and they did not support interactive user selection of atoms or concurrent display of multiple graphical representations for different atom selections. We describe the design and application of out-of-core visualization techniques for immersive display of large-size and long-timescale MD simulations, using fast algorithms for generating graphical representations of molecular geometry, outof-core optimized MD trajectory file formats, high performance I/O approaches, and multithreaded programming techniques. Our approach also leverages technological advances in commodity solid state disk (SSD) devices, to enable trajectory animation rates for large structures previously unachievable except by in-core approaches. These improvements effectively eliminate disk I/O as a visualization bottleneck for all but the largest petascale simulations, enabling users to work with large simulations, unhindered by the need to limit datasets to the capacity of physical memory or to use off-line preprocessing to reduce the size
Out-of-Core Visualization of MD Trajectories
3
of trajectories. These algorithms, optimization approaches, and trajectory file formats, have been implemented in a specially modified version of the widely used molecular visualization tool VMD [5, 6], and have been evaluated on a commodity PC hardware platform running Linux.
2
Out-of-Core Visualization of MD Trajectories
Scientists studying the structure and function of biomolecules have long found immersive visualization techniques useful for elucidating the details of the threedimensional structure, leading to better understanding of the function of the molecular machinery of living cells. Immersive visualization techniques require high-framerate stereoscopic rendering, typically with head-tracking, and responsive six-degree-of-freedom motion input. The most significant challenge for immersive visualization of large biomolecular complexes is to maintain stereoscopic rendering rates above 30 Hz, and, ideally, 48 Hz or more. When displaying static molecular structures, a molecular visualization tool can afford to spend time optimizing the graphics operations for best performance by pre-processing meshes to create efficient sequences of triangle strips, by sorting geometry so that a minimal number of rendering passes are required, and by sorting the geometry in each rendering pass into sequences that minimize graphics state changes. All of these optimizations together can enable a molecular visualization tool to achieve performance levels that may be five to ten times faster than an unoptimized scene. The high stereoscopic display rate requirements of immersive visualization are particularly challenging for the display of time-varying biomolecular structures. When animating MD trajectories to explore the dynamics of biomolecules, a molecular scientist wants to experience smooth, movie-like playback, free from any perceivable stuttering. This means that useful MD trajectory visualization must play trajectories at a rate of at least 10 to 15 frames per second, but ideally 30 frames per second or more. This creates a difficult task for a molecular visualization tool, since every atom moves at every timestep, and the brief interval between the display of trajectory timesteps offers little opportunity for the kinds of mesh and scene graph optimizations that are used for static structures. Biomedical researchers typically make heavy use of simplified graphical representations of proteins and nucleic acids. These simplified representations replace all-atom atomic detail with helical backbone ribbons and glyphs for nucleotides, sugars, and bases. These so-called “ribbon” or “cartoon” representations of secondary structure involve dynamic computation of space curves and extrusions, with user-specified material properties and colors that vary atom-byatom, or residue-by-residue. These computations are both arithmetic-intensive and memory-access intensive, since they must traverse the molecular topology for the components of the atomic structure that are selected for display. In order to minimize the impact of these computations for classical mechanics simulations where molecular structures retain a fixed bond topology, our implementation preanalyzes the molecular structure upon initial loading, and builds data structures
4
J.E. Stone, K.L. Vandivort, and K. Schulten
for fast traversal of key sub-components (e.g. protein and nucleic acid fragments) of the full molecular model, allowing the algorithms that generate per-timestep graphical representations to examine only the atomic structure data required by the atom selections and graphical representations that the user has specified. These optimizations often yield a 10× performance boost for large simulations such as the BAR domain visualization test case presented below. The remaining calculations involved in generating graphical representations are mostly arithmetic intensive, as a result of the optimizations alluded to above. While some of these calculations are well-suited for GPU computing techniques [4, 7–11], others involve divergent branches and irregular memory access patterns that make them impractical to migrate onto current-generation GPUs, so they remain a computational burden for animation of large MD trajectories. The final challenges that have prevented the use of immersive visualization techniques for large MD simulation trajectories were that in-core solutions lacked the necessary memory capacity, while out-of-core solutions lacked the necessary I/O bandwidth and did not provide jitter-free streaming performance. As described below, the I/O performance limitations that have hampered out-of-core approaches have been addressed through the combined use of improved MD trajectory file structures and I/O approaches, and state-of-the-art SSD storage devices. With the I/O performance problem mitigated, jitter-free interactive rendering is ensured by performing I/O asynchronously in a separate thread. While the main visualization thread consumes a trajectory timestep and performs the necessary geometric computations and rendering work, the I/O thread reads ahead to the next timestep.
3
High Performance MD Trajectory I/O
The key I/O requirements for effective out-of-core immersive visualization of MD trajectories include extremely high bandwidth disk I/O rates, jitter-free streaming of timesteps with mostly-sequential I/O, and minimization of impact on system memory bandwidth available for host-GPU transfers by rendering threads. Historically it has been difficult to achieve the bandwidth requirements for effective out-of-core visualization for MD simulations of large biomolecular complexes, but recent advances in the state-of-the-art for commodity solidstate disks (SSDs) have created an unforeseen opportunity, enabling out-of-core approaches to become practical for large MD simulations for the first time. Although SSDs can provide the required I/O rates at the hardware level, capitalizing on these rates within immersive molecular visualization software requires special design and implementation considerations. Molecular dynamics trajectories are typically stored using custom-designed file formats that are space efficient, and are time efficient for the simulation software to write during the course of simulation, often in a simple array-ofstructures format (see Fig. 1). Trajectory files contain, at a minimum, pertimestep atom coordinates and periodic cell dimensions. Many MD packages can also optionally store velocities, forces, energies or other properties.
Out-of-Core Visualization of MD Trajectories
5
Fig. 1. Memory representation of a typical timestep, stored in array-ofstructures format, where each atom’s complete set of heterogeneous data fields are stored contiguously, followed by subsequent atoms.
Fig. 2. Timestep stored in structure-ofarrays format, where each field of atomic data is stored sequentially for all atoms, followed by the next field of atomic data, for all atoms.
In consideration of the need for efficient per-timestep data fields composed of different data types, and the ability to skip loading optional data fields, the structure-of-arrays (see Fig. 2) organization begins to present efficiency advantages. The structure-of-arrays organization enables long sequential reads of individual fields, which encourage performance from hardware and operating system disk block read-ahead mechanisms and minimize the number of operating system calls. We note that the storage for data within individual fields may still benefit from an array-of-structures, if that allows the field to be read into visualization and analysis tools in their native in-memory arrangement, e.g. storing atomic positions with interleaved x, y, and z coordinates as used in VMD. The MD trajectory file layout and software design principles outlined above enable excellent disk I/O bandwidths, but several issues still remain. Mainstream operating systems provide automatic buffering and caching of filesystem I/O operations to increase I/O performance for typical application workloads, but these strategies are inappropriate for applications that need to stream tens or hundreds of gigabytes of data. This is far too much data to be cached effectively on today’s machines; indeed, these strategies actually reduce performance for streaming access to large MD simulation trajectories. The negative performance impact of operating system caching is most clearly demonstrated with high performance RAID arrays of SSDs, where buffered I/O achieves only 53-58% of the performance of low-overhead, direct unbuffered I/O methods. Direct unbuffered I/O interfaces typically require all read, write, and seek operations to operate on integer multiples of a particular filesystem- or operating system-defined block size, with all accesses aligned to block boundaries [12]. Direct I/O interfaces also require block-aligned source and destination buffers in host memory. The requirement for block-based I/O is a major source of added programming complexity associated with direct unbuffered I/O. It is often impractical to support legacy trajectory file formats efficiently with block-based I/O unless they have been designed with a structure-of-arrays style timestep data
6
J.E. Stone, K.L. Vandivort, and K. Schulten
Fig. 3. Memory representation of Timestep. a) Packed representation. Minimum memory usage. b) Blocked representation. Uses more memory overall, but allows efficient data transfer. c) Further gains can be achieved by intelligently grouping atomic data of importance into contiguous segments.
organization, as shown in Fig 2. The structure-of-arrays organization scheme automatically lends itself to block-based I/O because each per-timestep data field can be padded out to a multiple of the block size, while leaving the overall file structure intact. Application code must then be modified to use block-aligned and block-multiple-sized memory buffers, which, while conceptually straightforward, can be a significant software engineering exercise. The block size required by the direct I/O approach is OS-dependent, and in some cases depends on the filesystem in use on the storage system. The most common block sizes in widespread use are 512-bytes (disk hardware sector size), and 4 KB (filesystem block size). For a trajectory timestep coordinate storage format based on single-precision floating point, a 512-byte block can hold 42.6 atoms, and a 4 KB block can hold 341.3 atoms. The requirement that all I/O operations be padded to multiples of the block size causes some fragmentation at the end of each per-timestep field (see Fig. 3b). Common biomolecular simulations contain well over 30,000 atoms, so space lost to fragmentation at the end of each per-timestep field is inconsequential.
4
Representation-Dependent I/O Optimization
In MD simulations of biological molecules (e.g. proteins and nucleic acids), water often fills half or more the overall simulation volume. Simulations of structures that unfold or greatly extend in length can have even larger solvent fractions. Explicit-solvent MD simulations model water in full atomic detail rather than using continuum approximations, potentially resulting in a large fraction of MD simulation trajectory data and associated I/O bandwidth being associated with these water molecules. In most visualization scenarios, bulk solvent does not need to be displayed dynamically, providing a significant opportunity for optimization of I/O operations required for visualization, achieved by skipping past data (coordinates, velocities, etc.) associated with bulk solvent when reading in timesteps (see Fig. 3c). Although a user could remove bulk solvent and create a
Out-of-Core Visualization of MD Trajectories
7
reduced trajectory in an off-line preprocessing step, this approach consumes even more disk space and can be avoided by skipping unneeded data on-the-fly during visualization. When it is necessary or beneficial to display water molecules near the solute (i.e. protein or nucleic acid) structure, blocks containing only the selected solvent molecules can be read in individually as a special case, rather than being read or skipped in an all-or-nothing fashion. Selective reading approaches perform best when molecular structure building tools coalesce bulk solvent atoms, ions, and solute into one or very few contiguous ranges of atom indices, resulting in similar coalescing of per-timestep data fields both on-disk and in-memory. In cases where the structure building tools are not designed to optimize for selective reading functionality, a visualization tool could incur a large number of scattered read operations rather than a small number of sequential read operations, resulting in a large decrease in performance. By designing trajectory file formats so that they encode knowledge of which atom index ranges correspond to solute atoms, bulk solvent atoms, and other valuable atom data groupings, the I/O system can automatically skip loading the unneeded timestep coordinates, providing at least a factor of two increase in trajectory animation performance for most explicit-solvent MD simulations. Atom-granularity selective reading approaches that can read in individual atoms at a time are not usually beneficial for disk-based I/O because disks only perform I/O in block-sized transactions; furthermore, the filesystems in use on a particular disk may demand an even larger block size for I/O operations. Although the minimum block size has negligible impact on sequential I/O performance for long reads, it can have a drastic effect on the performance of atom-granularity approaches, making them completely ineffective for cases with an average stride between selected atoms that is less than the number of atoms stored in the required I/O block size. Atom-granularity selective reading approaches are most appropriate for cases where I/O bandwidth is severely constrained, such as when streaming trajectory data over a network connection from a remote supercomputer.
5
Performance Evaluation
We have measured the performance of prototype implementations of the techniques described above in a modified version of VMD [5], using several biomolecular simulation trajectories of varying sizes, listed in Table 1. All benchmarks were conducted on a test system containing two quad-core 2.67 GHz Intel Xeon X5550 CPUs with 72 GB RAM and an NVIDIA Quadro 7000 GPU. Disk-based trajectory I/O performance tests were performed using a single 500 GB SATA II hard drive, a single 120 GB SATA III 6 Gbit/s SSD, and an 8-way SSD-based RAID-0 volume, with a RAID stripe width of 32 KB, driven by an Areca 1880ix RAID controller installed in a PCI-express x8 expansion slot. All tests were run with CentOS Linux version 5.6.
8
J.E. Stone, K.L. Vandivort, and K. Schulten
Table 1. Molecular Dynamics Simulation Test Cases. The number of atoms are listed for each test case, indicating the atom count for the complete simulation, and for just the non-solvent (NoSolv) atoms. Test Case Atoms Description STMV Full: 955,226 Satellite Tobacco Mosaic Virus is a small plant virus NoSolv: 178,306 containing a viral capsid (60 identical proteins), and a 1,063 nucleotide single-stranded RNA genome. Ribosome Full: 2,935,347 The ribosome is a large molecular machine responsible NoSolv: 1,550,904 for translating genetic material into functional proteins. Membrane Full: 22,769,085 Membrane patch containing light harvesting proteins as NoSolv: 2,833,950 found in photosynthetic bacteria. BAR Full: 116,110,965 BAR domains are found in many organisms and drive NoSolv: 1,975,386 the formation of tubular and vesicular membrane structures in a variety of cellular processes.
5.1
Molecular Dynamics Trajectory I/O Performance
We evaluated the performance of multiple trajectory file structures and I/O approaches over a range of problem sizes including sizes suitable only for petascale supercomputers (See Table 1). The I/O approaches that we evaluated included: – DCD, Normal I/O The highest performance trajectory file reader for the legacy DCD binary trajectory format used by CHARMM, NAMD, X-PLOR, and other popular simulation packages, using traditional buffered I/O. – OOC, Normal I/O A newly designed trajectory file reader for a performanceoptimized out-of-core trajectory format, using traditional buffered I/O. – OOC, Direct I/O A newly designed trajectory file reader for a performanceoptimized out-of-core trajectory format, using a zero-copy unbuffered direct I/O implementation that bypasses the OS kernel filesystem buffer cache. – OOC, Direct I/O, NoSolv A hybrid approach combining the block-based direct I/O approach above, with algorithms that skip past atom coordinates associated with bulk solvent. In order to measure streaming read performance reliably, large trajectories were used (30 GB or larger), and the Linux filesystem buffer cache was cleared before each test run. The Normal cases use the Unix readv() system call to perform all read operations for a timestep with a single system call. The Direct cases open files with the O DIRECT flag for unbuffered I/O, and read trajectory file timesteps composed of 4 KB blocks, also using readv() to minimize system call overhead. Test cases using the NoSolv method read in protein, nucleic acid, and ions within the per-timestep atom coordinates, but skip past bulk solvent. Performance results are presented in Table 2. The performance results for the single-SSD and SSD RAID-0 cases demonstrate a significant performance advantage obtained from the block-based unbuffered direct I/O method (Direct), with performance improvements ranging from a factor of 1.6× up to 2.7× faster for
Out-of-Core Visualization of MD Trajectories
9
Table 2. Trajectory I/O performance is presented in terms of the rate of trajectory timesteps (TS) read per second and the associated I/O bandwidth, for several simulation sizes and I/O methods. Tests were performed on three types of storage hardware: a traditional rotating magnetic hard drive (HD), a solid-state disk (SSD), and an 8-way SSD-based RAID-0 volume (RAID). We do not report RAID speedups vs. HD or SSD for the largest cases as the test files were too large to be stored on the single drives. Hard- Test Case ware STMV HD Ribosome
I/O Method
DCD, Normal DCD, Normal DCD, Normal OOC, Normal STMV OOC, Direct OOC, Direct, NoSolv SSD DCD, Normal OOC, Normal Ribosome OOC, Direct OOC, Direct, NoSolv DCD, Normal OOC, Normal STMV OOC, Direct OOC, Direct, NoSolv DCD, Normal OOC, Normal Ribosome OOC, Direct OOC, Direct, NoSolv RAID DCD, Normal OOC, Normal Membrane OOC, Direct OOC, Direct, NoSolv DCD, Normal OOC, Normal BAR OOC, Direct OOC, Direct, NoSolv
Atoms Loaded 0.955M 2.94M 0.955M 0.955M 0.955M 0.178M 2.94M 2.94M 2.94M 1.55M 0.955M 0.955M 0.955M 0.178M 2.94M 2.94M 2.94M 1.55M 22.8M 22.8M 22.8M 2.83M 166M 166M 166M 1.98M
Rate Bandwidth Speed vs. DCD (TS/s) (MB/s) RAID SSD HD 9.3 102 0.14 0.39 1.0 3.0 105 0.12 0.38 1.0 23.7 259 0.35 1.0 2.5 29.6 323 0.30 1.2 3.2 37.2 406 0.20 1.6 4.0 174.3 355 0.45 7.3 18.7 7.8 262 0.32 1.0 2.6 9.5 319 0.28 1.2 3.2 12.2 409 0.20 1.6 4.1 23.0 408 0.24 2.9 7.7 67 754 1.0 2.83 7.2 98 1,075 1.5 3.31 10.5 182 1,998 2.7 4.89 19.5 386 787 5.8 2.21 41.5 24.3 815 1.0 3.12 8.1 33.7 1,133 1.4 3.55 11.2 60.7 2,037 2.5 4.98 20.2 96.4 1,711 4.0 4.12 32.1 3.0 781 1.0 4.6 1,207 1.5 8.0 2,087 2.6 46.5 1,508 15.5 0.6 708 1.0 0.9 1,189 1.5 1.6 2,130 2.6 76.3 1,725 127.2 -
the larger simulations. The performance gain demonstrates the benefit of avoiding multiple memory buffer copies that occur with traditional buffered I/O. We expect the performance benefit from the Direct approach to be even greater under circumstances of concurrent high-bandwidth host-GPU memory transfers, where the reduced consumption of finite host memory bandwidth will leave more bandwidth available for the GPUs. The Direct method also avoids deleterious effects of memory starvation that can occur on operating systems such as Linux, where heavy sustained streaming I/O (with traditional OS buffering) can cause application memory to get paged out, causing unwanted stuttering and intermittent freezing during immersive display. The two largest cases, Membrane and BAR, did not obtain trajectory timestep streaming rates fast enough to support smooth interactive trajectory animation
10
J.E. Stone, K.L. Vandivort, and K. Schulten
Fig. 4. A left-eye screen shot from a running out-of-core STMV MD trajectory visualization, achieving a stereoscopic display rate of 44 frames/s
Fig. 5. A left-eye screen running out-of-core BAR trajectory visualization, stereoscopic display rate of
shot from a domain MD achieving a 67 frames/s
Table 3. Comparison of visualization redraw and trajectory animation rates for display of a static molecular scene, in-core MD trajectory playback (entirely from pre-loaded data in RAM), and out-of-core MD trajectory playback from an RAID-0 SSD array Test Case
Visualization Mode
Atoms Display Rate Loaded (frames/s) static scene 0.955M 105 in-core trajectory animation 0.955M 48 out-of-core trajectory animation 0.955M 44 static scene 116M 116 in-core trajectory animation 116M 70 out-of-core trajectory animation 1.98M 67
STMV STMV STMV BAR domain BAR domain BAR domain
Stereo, Stereo, Stereo, Stereo, Stereo, Stereo,
when loading all atoms, even when using the Direct I/O approach. For these cases, we evaluated a selective-read optimization (NoSolv) that skips reading atom coordinates associated with bulk solvent in cases where they are not needed by the user’s graphical representations. We found that the NoSolv optimization boosts performance significantly for the two largest test cases, by a factor of 5.8× for the Membrane test, and by a factor 47.6× for the BAR test. Although the I/O work is significantly decreased by the NoSolv approach, the seek operations that move the file pointer past the bulk solvent incur some operating system overhead and reduce the ability of the RAID controller to perform sequential read-ahead. The disk I/O bandwidth loss associated with the NoSolv optimization is minor for the large test cases where it is truly needed, and is easily offset by the overall trajectory streaming performance gain achieved.
Out-of-Core Visualization of MD Trajectories
5.2
11
Out-of-Core Immersive Visualization Performance
The visualization test results shown in Table 3 and Figs. 4 and 5 evaluate the immersive visualization performance for static structure display and for two MD trajectory animation cases comparing a traditional in-core approach vs. the best performing out-of-core MD trajectory I/O methods described above. The out-ofcore trajectory files were read on-the-fly from an SSD RAID. The visualization performance results clearly show that the rendering portion of the visualization workload is insignificant and that the main source of runtime is the per-timestep recomputation of time-varying geometry for the molecular representations. For STMV, the interior molecular representations were continually regenerated on-the-fly from several atom selections totaling 105,524 protein and nucleic acid atoms. Both STMV trajectory tests regenerated the displayed molecular geometry on every frame redraw, representing a lower-bound immersive display rate in each case. The STMV out-of-core test achieved 91% of the in-core performance, a very usable 44 frames/s. The BAR domain test used a very fast OpenGL point-sprite sphere shader for visualization of the solute portions of the model, loading the out-of-core trajectory timesteps using the NoSolv I/O approach, and reaching 95% of the in-core performance. Although the BAR domain used a simpler visual representation than the STMV case, 202,168 particles were displayed (roughly twice as many), requiring a faster approach. We also tested a standard cartoon representation of the BAR domain, but this reduced the display frame rate to 30 frames/s – at the bottom end of the immersion threshold.
6
Future Direction
The performance results above demonstrate that it is possible to achieve the stereoscopic display rates required for effective immersive visualization while smoothly animating large-size and long-timescale all-atom MD simulation trajectories. We plan to extend the selective-read (NoSolv) approach described in this paper to support block-granularity selective loading of trajectory data, enabling higher performance for cases where the user makes sparse atom selections within large biomolecular complexes. Petascale MD codes have begun using parallel I/O to write trajectory data to multiple files, creating an opportunity to use multiple SSD RAID arrays concurrently within a single VMD instance to achieve I/O rates beyond the ability of single PCIe x8 RAID controller. Preliminary tests using multiple RAID controllers achieved I/O rates of up to 4,081 MB/s, indicating that a new multi-file trajectory format should enable performance limited only by the host machine’s PCIe bus and operating system overhead. We also plan to build on our prior work with network-connected interactive MD simulation [6, 13] to explore the use of atomic coordinate compression and atom-granularity selective-read approaches to support immersive visualization of extremely large MD trajectories stored on remote supercomputers, accessed using a client-server version of VMD over high bandwidth networks.
12
J.E. Stone, K.L. Vandivort, and K. Schulten
Acknowledgments. This work was supported by National Institutes of Health grant P41-RR005969.
References 1. Kreylos, O., Bawden, G.W., Kellogg, L.H.: Immersive visualization and analysis of LiDAR data. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part I. LNCS, vol. 5358, pp. 846–855. Springer, Heidelberg (2008) 2. Kuester, F., Bruckschen, R., Hamann, B., Joy, K.I.: Visualization of particle traces in virtual environments. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, VRST 2001, pp. 151–157. ACM, New York (2001) 3. Gao, Y., Deng, B., Wu, L.: Efficient view-dependent out-of-core rendering of largescale and complex scenes. In: Proceedings of the 2006 ACM International Conference on Virtual Reality Continuum and its Applications. VRCIA 2006, pp. 297–303. ACM, New York (2006) 4. Grottel, S., Reina, G., Dachsbacher, C., Ertl, T.: Coherent culling and shading for large molecular dynamics visualization. Computer Graphics Forum (Proceedings of EUROVIS 2010) 29, 953–962 (2010) 5. Humphrey, W., Dalke, A., Schulten, K.: VMD – Visual Molecular Dynamics. J. Mol. Graphics 14, 33–38 (1996) 6. Stone, J.E., Kohlmeyer, A., Vandivort, K.L., Schulten, K.: Immersive molecular visualization and interactive modeling with commodity hardware. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Chung, R., Hammound, R., Hussain, M., KarHan, T., Crawfis, R., Thalmann, D., Kao, D., Avila, L. (eds.) ISVC 2010. LNCS, vol. 6454, pp. 382–393. Springer, Heidelberg (2010) 7. Tarini, M., Cignoni, P., Montani, C.: Ambient occlusion and edge cueing for enhancing real time molecular visualization. IEEE Transactions on Visualization and Computer Graphics 12, 1237–1244 (2006) 8. Chavent, M., Levy, B., Maigret, B.: MetaMol: High-quality visualization of molecular skin surface. J. Mol. Graph. Model. 27, 209–216 (2008) 9. Stone, J.E., Saam, J., Hardy, D.J., Vandivort, K.L., Hwu, W.W., Schulten, K.: High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs. In: Proceedings of the 2nd Workshop on GeneralPurpose Processing on Graphics Processing Units. ACM International Conference Proceeding Series, vol. 383, pp. 9–18. ACM, New York (2009) 10. Krone, M., Bidmon, K., Ertl, T.: Interactive visualization of molecular surface dynamics. IEEE Transactions on Visualization and Computer Graphics 15, 1391– 1398 (2009) 11. Chavent, M., Levy, B., Krone, M., Bidmon, K., Nomine, J.P., Ertl, T., Baaden, M.: GPU-powered tools boost molecular visualization. Briefings in Bioinformatics (2011) 12. Arcangeli, A.: O DIRECT. In: Proceedings of the UKUUG Linux 2001 Linux Developer’s Conference (2001) 13. Stone, J.E., Gullingsrud, J., Grayson, P., Schulten, K.: A system for interactive molecular dynamics simulation. In: Hughes, J.F., S´equin, C.H. (eds.) 2001 ACM Symposium on Interactive 3D Graphics, ACM SIGGRAPH, New York, pp. 191– 194 (2001)
The OmegaDesk: Towards a Hybrid 2D and 3D Work Desk Alessandro Febretti, Victor A. Mateevitsi, Dennis Chau, Arthur Nishimoto, Brad McGinnis, Jakub Misterka, Andrew Johnson, and Jason Leigh Electronic Visualization laboratory, University of Illinois at Chicago
Abstract. OmegaDesk is a device that allows for seamless interaction between 2D and 3D content. In order to develop this hybrid device, a new form of Operating System is needed to manage and display heterogeneous content. In this paper we address the hardware and software requirements for such a system, as well as challenges. A set of heterogeneous applications has been successfully developed on OmegaDesk. They allowed us to develop a set of guidelines to drive future investigations into 2D/3D hybridized viewing and interaction.
1 Introduction Historically, Virtual Reality (VR) systems have been thought of entirely for the purposes of supporting virtual world interactions. In 1999 the Electronic Visualization Laboratory (EVL) conceived of a new type of work desk that would blend 2D and 3D display and interaction capabilities to enable users to work seamlessly with 2D content (such as text documents and web browsers), as well as 3D content (such as 3D geometry and volume visualizations). We believed that for VR to emerge out of a small niche community, it had to become a seamless part of the computing continuum. At the time, the state of the art in hardware did not make such a conceived system practical. However today minimally encumbering and reliable stereoscopic displays and tetherless tracking systems are becoming highly affordable. Also, numerous vendors are emerging to provide multi-touch overlays that are easy to incorporate into existing display systems. It is therefore possible now to develop our hybrid 2D/3D work desk, which we call OmegaDesk. What is still missing however is a new form of Operating System that enables the effortless and intuitive manipulation of both 2D content (such as spreadsheets, word processing documents, web browsers) and 3D content (such as CAD or scientific visualizations). In this paper we report on our first steps toward addressing this problem which resulted in the development of an API and exemplary applications for examining issues relating to 2D/3D hybridized viewing and interaction. 1.1 Vision The effectiveness of presenting data in different modalities has been the subject of previous research. 2D views have been found to be better when used to establish precise relationships between data, and for visual search [1] and [2], while 3D is very G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 13–23, 2011. © Springer-Verlag Berlin Heidelberg 2011
14
A. Febretti et al.
effective for approximate 3D manipulation and navigation, especially with the use of appropriate cues, like shadows. In [3] it is suggested that combining both views leads to good or better analysis and navigation performance than using 2D or 3D alone. These findings are confirmed in [4], where in an air traffic control simulation 2D displays proved to be better for checking aircraft speed and altitudes while 3D was best used to perform collision avoidance.
Fig. 1. This figure illustrates the initial concept of OmegaDesk as envisioned in 1999
Our vision for OmegaDesk is of an integrated hardware and software system that allows for rapid development and deployment of tools that make use of this hybrid visualization capability. Also, we envision OmegaDesk not specifically as a VR device or a Workstation, but a Work Desk - i.e. computer-enhanced furniture. Applicative scenarios range from scientific visualization of complex scientific datasets, ([5], [6] ), interaction with dynamic geospatial information (e.g. air traffic control, [4]), analysis of medical data for research or surgery planning ( [7], [8]), and in general scenarios where a 3D, qualitative display of information can be enriched by a separate or overlayed 2D, quantitative view of the same information. We will first describe the implementation of the OmegaDesk, and the middleware to drive it. Along the way we will describe some of the challenging issues we have encountered in building the system. Then we will describe the applications that we have built to test the system, and the lessons learned. Lastly we will conclude with an evaluation of developed case studies and our plans for future investigation and development of the system.
2 Related Work Considered as a purely hardware system, the OmegaDesk structure is comparable to other designs. The sliceWIM system presented in [5] offers two separate views of the
The OmegaDesk: Towards a Hybrid 2D and 3D Work Desk
15
data, with interaction done exclusively through a touch interface. While effective, the system has been designed around a very specific task (exploration of volume datasets) and while it supports an overview and detail view of the data, it is not really designed to support the overlapping of 3D and 2D information. The IQ-Station [9] is a low cost immersive system based on a 3D display and a set of OptiTrack motion capture cameras. Although there are some technical similarities between the IQ-Station and the OmegaDesk, the former focuses less on the hybrid 2D and 3D aspect that is central in our design. In the introduction we also stated how OmegaDesk needed an operating system or middleware that would enable the development of applications on a hybrid 2D/3D system. This middleware would allow for both high performance scientific visualization and interaction with higher level, rapid development toolsets. This gives application programmers the ability to rapidly develop on platforms such as Unity3D and Processing[10]. Additionally it was also important that there was a layer of abstraction between input devices and the developer. A variety of libraries (as trackD[11] and Vrpn[12]) offer an abstraction layer to handle virtual reality input devices. Others, like freeVR[13] and the Vrui[14] toolkit take this a step further, integrating display management for 3D rendering. Products like getReal3D[15] allow users to design virtual reality environments using of high level toolsets (Unity in this case).
3 OmegaDesk Hardware The OmegaDesk concept is illustrated in Fig. 2. OmegaDesk consists of two stereo displays, one positioned horizontally in a 45-degree angle and another positioned vertically in front of the user. The PC that drives the displays is a Windows 7 64bit machine on an Intel Core2 2.93GHZ with 4GB of RAM and two NVIDIA GeForce GTX 480 GPU cards. For OmegaDesk two Panasonic Viera TC-P65VT25 have been used.
Fig. 2. This figure shows the various commercial technologies that make up OmegaDesk
The use of commercially available displays allows the flexibility of using any highresolution 3D consumer display system and enables the low cost construction of such
16
A. Febretti et al.
work-desks. While the cost of high-resolution 3D displays has dropped significantly in the past 5 years, it is our belief that it will drop further, making it affordable to build future OmegaDesk-like work desks. Table 1. Operational modes of OmegaDesk
Operational Mode Top 3D, Bottom 3D
Top 3D, Bottom 2D
Top 2D, Bottom 3D
Top 2D, Bottom 2D
Potential Application Usage Fully immersive mode. Ideal for applications that require navigation thru a virtual space or bringing 3D objects close-up for manipulations, etc. 3D Viewer mode. The vertical display is used to visualize 3D objects and worlds, while the horizontal display can be used to control aspects of the visualization. ‘Bathtub’ mode. The horizontal display is used to look at 3D data bottom-down, like looking at a fish tank from top and the vertical display is used to look at 2D projections or slices of the data. Touch augmented desktop / cubicle mode. The vertical display is the wall of the cubicle while the horizontal display is like a giant iPad where document editing and manipulation can be performed.
3.1 Input Interfaces For manipulation of objects in 2D the bottom display is overlayed with the MultiTouch G3 Plus overlay manufactured by PQLabs that can detect simultaneously up to 32 touches. For head tracking and 3D object manipulation OmegaDesk can use either the five OptiTrack FLEX:V100R2-FS positioned around OmegaDesk or a Microsoft Kinect. Kinect user tracking is performed through the OpenNI library[16]. While Kinect can perform tether-less multi-body tracking, it lacks the accuracy of OptiTrack and does not provide orientation for all the tracked body parts. On the other hand the coverage area of OptiTrack is reduced in comparison with the Kinect’s (Fig. 3).
Fig. 3. This diagram shows the area of coverage of both the Optitrack and the Kinect
The OmegaDesk: Towards a Hybrid 2D and 3D Work Desk
17
Immersive navigation is accomplished with the use of game controllers. With the wide adoption of game consoles like the Wii, Xbox 360 and PlayStation 3 users are accustomed to navigate worlds using a game console. Both the PlayStation 3 and Xbox 360 wireless controllers can be used as props when developing applications for OmegaDesk.
4 Omegalib The final software development objective for OmegaDesk would be the creation of a 2D-3D-aware Operating System. A first step towards that objective is the implementation of a middleware system that would ease the development of applications on hybrid work desks, and increase their portability across hardware changes or device configurations. We explained how none of the existing libraries was covering our full set of requirements in an easy, out-of-the-box way. This led us to build our own software development kit, called Omegalib.
Fig. 4. This diagram shows the overall outline of the Omegalib architecture
4.1 Hardware Abstraction Inside Omegalib, hardware abstraction is implemented through two concepts: display system abstraction and input system abstraction. Display System Abstraction. Omegalib manages rendering using the concept of display systems: A display system takes care of setting up the graphical hardware system, creating windows and viewports, setting up transformations and rendering pipelines and calling the appropriate application-level rendering functions. Currently, two display systems have been implemented: a simple GLUT based display system used mainly for debug purposes, and an Equalizer based display system. Equalizer is a toolkit for scalable parallel rendering based on OpenGL. It allows users to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics
18
A. Febretti et al.
systems to single-processor single-pipe desktop machines [17]. In the near future, we are considering the introduction of a new display system to support autostereoscopic displays based on active parallax barriers, like the Dynallax [18]. The separation between rendering management and the actual application rendering code allowed us to support the concept of rendering layers. Layers represent conceptually separate sets of graphical primitives (for instance a 3D scene and a 2D interface) that can be enabled or disabled for specific output channels of the display system. In this way, it is very easy to implement separate 3D views for the same application, or create a management window running on a secondary display, showing an administration UI or a debug-mode scene rendering. It is also possible to perform rendering of layers on separate threads, and compose them in the target channel frame buffer: this can be used to make the rendering performance of 2D elements of the application independent from the complexity of the 3D scene, in order to maintain a good frame rate and responsiveness on the UI as the visualized scene grows in complexity. Input Device Abstraction. Omegalib gives applications access to input devices through the concept of event services: an event service manages one physical or logical event source in the system. For instance it can: • offer access to events from a real input device, like a touch display or a motion capture system; • receive events from a remote source through a network connection; • generate input from a logical source, like a user interface button or slider; • process events from other sources to act as a background utility service. For example, a service can get position data for the user head from a tracking or motion capture service, update the observer head matrices for a scene and send the application updates on the user tracking status). Event services allow for a great deal of flexibility. They abstract the physical input devices available to the system. Also, they allow to modularize several common components of a virtual reality application (like user tracking or network message passing), so that they can easily be reused in applications. Omegalib also supports the streaming of events to external applications, acting as a display-less input server. This simplifies the development of OmegaDesk applications using differents toolsets (as Unity or Processing) and streamlines the integration of input support into legacy applications that treat the device displays as normal screens, but want to use the motion capture, tracking or multitouch capabilities of OmegaDesk. Configuration. Similar to other VR libraries, Omegalib allows applications to be reconfigured using system description files: display system, event service and application parameters are all stored in configuration files: the same application can run on OmegaDesk with head and hand tracking, on a multitouch tiled display without stereo support, or on a developer laptop using just mouse and keyboard interaction. 4.2 Interaction Through use of tracker based mocap, Kinect user tracking and touch screens OmegaDesk offers a wide range of possibilities in terms of user interaction. Different
The OmegaDesk: Towards a Hybrid 2D and 3D Work Desk
19
applications may request subsets of the available input devices and implement an interaction scheme that works best for the specific application scenario: in some instances, the motion capture system may be used just for head tracking, while interaction with the application 3D objects can be realized through the touch screen. In other scenarios we may need a full mocap-based interaction scheme, with direct hand manipulation of the 3D objects. We think a certain, predefined number of interaction metaphors would satisfy most of the interaction needs of final applications. In this case, it makes sense to modularize them and make them available to application developers as packaged interaction schemes that can be easily turned on, off or switched inside an application, allowing for both consistency and reuse of interaction schemes, and fast prototyping of applications using different interaction techniques. To implement this, omegalib offers support for a simple scene graph system based on Ogre[19] that can be controlled through interaction objects. These objects implement interaction policies, by getting input from the event services and controlling nodes and objects in the scene graph. 4.3 Integration with Scientific Visualization Tools One of the purposes of OmegaDesk is to be used as a scientific visualization tool: it is therefore necessary to integrate it with standard tools and libraries, like the Visualization Toolkit (VTK) [20]. Through Omegalib, Omegadesk is able to load VTK pipelines as python scripts, render them through the omegalib display system and interact with VTK actors and 3D models using the interaction schemes presented in the previous section. VTK python scripts can also create user interface widgets that modify the visualization pipeline, and can be controlled through the touch screen. It is also possible to create VTK programs for OmegaDesk natively, using the C++ VTK API directly. This makes it extremely easy to build VTK programs for OmegaDesk or port legacy pipelines to the system.
Fig. 5. The integration of VTK pipelines inside Omegalib is done through a support module that performs VTK actor encapsulation and feeds back user actions to the pipeline
5 Application Case Studies A set of heterogeneous application has been developed on OmegaDesk so far. Some are built to test the interaction and display capabilities of the system while others are designed to solve domain-specific problems in areas as different as rehabilitation therapy, histology or fluid dynamics.
20
A. Febretti et al.
5.1 Mesh Viewer / VTK Viewer The mesh viewer application has been developed to test 3D object manipulation via hand gestures. It allows the user to drop one or more objects inside a 3D scene by selecting them through the touch display. Interaction takes place using both hands to intuitively perform rotation, scaling and moving. Head and hand tracking can be provided by the Optitrack system or the Kinect alone. The VTK viewer application takes the mesh viewer concept a step further: it supports loading of VTK pipelines through python scripts and rendering of multiple VTK actors. These actors can then be manipulated using the same interaction techniques offered by the mesh viewer. Additionally, selected parameters of the VTK pipeline can be configured at runtime though a touch interface created dynamically on the bottom display. 5.2 Physical Therapy Simulation The Physical Therapy Simulation is a rehabilitation exercise created using Unity3D and Omegalib through a collaborative effort with the Kinesiology department at UIC. It is used to test the efficacy of physical therapy through the use of VR. The scene consists of a simple room where a virtual ball is tossed to the patient. This has the effect of strengthening feedforward postural control in the user/patient which allows for maintaining a quality of balance during daily movements. This application will help determine if visual stereoscopy will provide enough visual cues to the brain to enhance current physical therapy methods. It utilizes Omegalib's data streaming capability from an OptiTrack motion capture system and Kinect. 5.3 Histology Viewer With the development of powerful microscope optics and the latest advances in image sensors that deliver high resolution imaging capabilities, the scientists are able to dwell into the micro and nano scale to explore sightings unseen under normal conditions by the naked eye. In particular, in the medical lasers research field, physicians study 1cm by 0.5cm blocks of laser damaged skin. Using specialized hardware the block is sliced in 4 microns thick slices and digitized by the use of a powerful microscope equipped with a medical imaging device. Typically the physicians use a standard image viewer to browse through the histology images and identify the damaged parts. To leverage the OmegaDesk capabilities, a prototype Histology Viewer was developed. The skin block is reconstructed by stacking the slices and using ray-casting algorithms to generate a data volume. The top display visualizes the 3D reconstruction and gives physicians the ability to look at the data with an high level of detail. The bottom multi-touch display controls the visualization and is used to select what slices of the block will be shown. The physicians can browse back and forth through the data by touching and sliding and also select slices of interest to investigate further. Zooming and rotating are also supported by the pinching and rotating gestures. 5.4 Flow Visualization FlowViz is a generic 3D flow visualization for Omega Desk, The application has been built using Processing, and has been designed to be easily portable to devices offering
The OmegaDesk: Towards a Hybrid 2D and 3D Work Desk
21
a subset of the capabilities of OmegaDesk The goal of the project was to create a tool that would enable the viewer to better understand the complex nature of flow data. It is thought that viewing the complex 3D flow in a native 3D environment will allow the viewer to better understand its behavior. Also, by utilizing the multi-touch interface the viewer is allowed to interact with the simulation in an intuitive way: users can touch a 2D representation of the 3D view, causing a stream source to be spawned from the point touched. This source can either be a dynamic particle generator or a static streamline. Particles will flow through the vector field, exposing its behavior. In addition the user may spawn multiple plot windows showing different representations of the model. Users can brush over and select portions which outline corresponding regions of the 3D data.
6 Evaluation and Future Work This paper presented OmegaDesk, a prototype 2D and 3D work desk. We described the requirements for such a system to be effective, and how we addressed them at the software and hardware level. The development of several heterogeneous applications on the system allowed us to assess its efficacy in very different domains. The presented applications made use of different device modalities. The mesh viewer used both displays as 3D viewports to create a more immersive experience, overlaying a 2D user interface on the touch-enabled screen, and used hand gestures to interact with the data. The histology and flow visualizations treated the bottom screen as a 2D data presentation display, with the entire interaction driven by the touch surface (no hand gestures). Finally, the physical therapy simulation made use of the top 3D screen only. In this case the interaction was based on hand and head tracking, without the need for touch support. Even the current set of applications does not cover all of the possible OmegaDesk configurations, it allowed us to develop an initial set of considerations and guidelines for future development on this platform. It is clear how 3D hand gestures can be used for approximate object manipulation, or for applications that don’t need precise control. In these instances they can be a very effective and intuitive way of interacting with the system. When more control is needed though, the power and precise control offered by a touch screen and a 2D or 2.5D interface is unmatched. In this case, it is very important to link the information displayed on the 2D and 3D portions of the application, so that changes in one view of the data influence all the others. These changes should be propagated as quickly as possible and, most importantly, each view of the data should be able to update itself on the displays, without depending on the refresh speed of other views. This is similar in concept to the separation of processes running in an operating system: even if they can exchange data with each other, none of them should be allowed to slow down the entire system. Our future work will involve not only building new applications leveraging OmegaDesk capabilities, but also continuing the development of omegalib, to make it a complete, Operating System – like middleware supporting complex multimodal development on our evolving hardware system.
22
A. Febretti et al.
Pictures
(a)
(b)
(c)
(d)
Fig. 6. Photos of several applications running on OmegaDesk. (a) User interaction with 2D graphs of water flow in a specific area in Corpus Christi Bay as he compares them to the vector field of the surrounding areas. (b) Reviewing 2D histology slides while comparing them to the 3D volume rendering. (c) A user rotating and translation an object within the mesh viewer. (d) A user using OmegaDesk to simulate catching a ball as part of physical therapy.
Acknowledgements. This publication is based on work supported in part by Award Nos. FA7014-09-2-0003, FA7014-09-2-0002, made by the US Air Force, and Award CNS-0935919, made by the National Science Foundation.
References 1. Smallman, H.S., St John, M., Oonk, H.M., Cowen, M.B.: Information availability in 2D and 3D displays. IEEE Computer Graphics and Applications 21, 51–57 (2001) 2. Springmeyer, R.R., Blattner, M.M., Max, N.L.: A characterization of the scientific data analysis process. In: Proceedings of IEEE Conference on Visualization 1992, pp. 235–242 (1992)
The OmegaDesk: Towards a Hybrid 2D and 3D Work Desk
23
3. Tory, M., Kirkpatrick, A., Atkins, M., Moller, T.: Visualization task performance with 2D, 3D, and combination displays. IEEE Transactions on Visualization and Computer Graphics 12, 2–13 (2006) 4. Van Orden, K., Broyles, J.: Visuospatial task performance as a function of two- and threedimensional display presentation techniques. Displays 21, 17–24 (2000) 5. Coffey, D., Malbraaten, N., Le, T., Borazjani, I., Sotiropoulos, F., Keefe, D.F.: Slice WIM: a multi-surface, multi-touch interface for overview+detail exploration of volume datasets in virtual reality. In: I3D 2011: Symposium on Interactive 3D Graphics and Games (2011) 6. Kreylos, O., Bethel, E.W., Ligocki, T.J., Hamann, B.: Virtual-Reality Based Interactive Exploration of Multiresolution Data, pp. 205–224. Springer, Heidelberg (2001) 7. Hemminger, B.M., Molina, P.L., Egan, T.M., Detterbeck, F.C., Muller, K.E., Coffey, C.S., Lee, J.K.T.: Assessment of real-time 3D visualization for cardio-thoracic diagnostic evaluation and surgery planning. J. Digit Imaging 18, 145–153 (2005) 8. Pechlivanis, I., Schmieder, K., Scholz, M., König, M.: 3-Dimensional computed tomographic angiography for use of surgery planning in patients with intracranial aneurysms. Acta ...(2005) 9. Sherman, W.R., O’Leary, P., Whiting, E.T., Grover, S., Wernert, E.A.: IQ-station: A low cost portable immersive environment. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Chung, R., Hammound, R., Hussain, M., Kar-Han, T., Crawfis, R., Thalmann, D., Kao, D., Avila, L. (eds.) ISVC 2010. LNCS, vol. 6454, pp. 361–372. Springer, Heidelberg (2010) 10. Processing (2011), http://processing.org/ 11. trackd, Mechdyne Corporation (2011), http://www.mechdyne.com/trackd.aspx 12. Russell, M., Taylor Ii, T.C.H.A.S.H.W.J.J.A.T.H.: VRPN: A Device-Independent. Network-Transparent VR Peripheral System (2001) 13. FreeVR (2011), http://www.freevr.org/ 14. Vrui VR Toolkit (2011), http://idav.ucdavis.edu/~okreylos/ResDev/Vrui/index.html 15. getReal3D, Mechdyne Corporation (2011), http://www.mechdyne.com/getreal3d.aspx 16. OpenNI (2011), http://www.openni.org/ 17. Eilemann, S., Makhinya, M., Pajarola, R.: Equalizer: A Scalable Parallel Rendering Framework. IEEE Transactions on Visualization and Computer Graphics 15, 436–452 (2009) 18. Peterka, T., Kooima, R., Sandin, D., Johnson, A., Leigh, J., DeFanti, T.: Ad-vances in the Dynallax Solid-State Dynamic Parallax Barrier Autostereo-scopic Visualization Display System. IEEE Transactions on Visualization and Computer Graphics 14, 487–499 (2008) 19. OGRE (2011), http://www.ogre3d.org/ 20. Schroeder, W.J., Avila, L.S., Hoffman, W.: Visualizing with VTK: A Tutorial. IEEE, 1–8 (2000)
Disambiguation of Horizontal Direction for Video Conference Systems Mabel Mengzi Zhang, Seth Rotkin, and J¨urgen P. Schulze University of California San Diego 9500 Gilman Dr, La Jolla, CA 92093 {mabel.m.zhang,sethrotkin}@gmail.com,
[email protected]
Abstract. All existing video conferencing systems which support more than two sites and more than one user at each site suffer from directional ambiguity: not only is it generally impossible for two remote users to look each other in the eyes, but even just horizontal directionality is not preserved. Under horizontal directionality we understand that the direction of the users’ gaze or pointing fingers does not match what the other participants perceive. We present a video tele-conferencing concept, which, by combining existing software and hardware technologies, achieves horizontal directionality for multiple sites and participants at each site. Our solution involves multiple cameras, as well as large stereo or multi-view display walls at each site. Because building a physical prototype of our proposed system would have been fiscally impossible for us, we instead built a prototype for our virtual realit CAVE. In this publication we report on our experiences and findings with this prototype.
1 Introduction Teleconferencing using distributed virtual reality (VR) as opposed to traditional 2D video based tele-conferencing has repeatedly been shown to have the potential to be more realistic because of the more natural interactions 3D environments allow [7]. The reason why VR can work better than 2D video is that it can allow realistic eye contact and directionality, which means that when a person turns to the image of another on the display device, that other person perceives correctly that he or she has been turned to, and everybody else in the tele-conference can see that those two participants are facing each other. In practice, none of these VR based approaches have been commercially successful, we hypothesize that this is because of the high level of software complexity involved, the level of achievable visual accuracy, and the inherent latency such approaches introduce into an already latency-prone application due to long distance network transfers. Our approach started with the concept of the Cisco TelePresence systems, which are among the most sophisticated commercial tele-conferencing systems. We directly use camera images, which allows for realistic imagery at the remote site, creating a stronger notion of presence to feel that the participants share the same physical space, which is the primary goal of our and many prior tele-conferencing systems. In this publication, we are going to summarize prior work next, then describe our approach, then we present our implementation, and finally discuss the insight we gained with our VR prototype. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 24–32, 2011. c Springer-Verlag Berlin Heidelberg 2011
Disambiguation of Horizontal Direction for Video Conference Systems
25
2 Related Work To our knowledge, this is the first virtual reality simulator of a multi-site videoconferencing system. It is modeled after the Cisco TelePresence 3000 system, which, along with the HP Halo system (now purchased by Polycom), could be considered the state of the art of video teleconferencing systems [14]. Both systems are steps in the direction we are exploring in this paper. They utilize multiple displays and cameras, specifically placing the cameras so that some viewing directionality is maintained. However, neither system can provide correct directionality for all participants because each participant is only captured by one camera, so that all participants see the same view of each participant. For example, if a participant looks directly at his or her dedicated camera, it will appear to all remote participants as if that person looks directly at them. One of the most state-of-the-art approaches to video tele-conferencing is that of Maimone and Fuchs [8], which utilizes five Microsoft Kinect devices in order to achieve one-on-one video conferencing with a high level of realism, including eye contact. This work shows that re-creating eye contact is still a hot topic for video-conferencing, and that multiple cameras can be employed to achieve this effect. Our approach goes beyond this work in that it outlines a concept which scales to many participating sites with multiple users at each site. Also, we don’t require the 3D reconstruction of the participants, which is also at the core of related work by Wu et al. [17] and Chu et al. [2], which adds latency, and by using regular cameras instead of the Kinect the depth range for how far the user can be from the camera is less limited. Of course, Maimone’s approach has the benefit of creating a textured 3D model of each participant, which can be used for more than just the simulation of eye contact. Probably the most related to our proposed system’s video camera setup is that of HP’s Coliseum system [1]. They propose installing multiple cameras around the user to be able to re-create multiple views of the user from different viewing angles. We achieve this by mounting the cameras in a linear array, which permits us to support more than one user, and our system adds multi-site directional consistency and a concept of how to display the imagery on a screen for multiple viewers. Part of our work is based on ideas in a patent by Fields [3], which proposes that for multi-site video conferencing, the sites should virtually be arranged as the edges of an equilateral polygon with n edges, an n-gon. Fields also proposes using an array of cameras, an array of displays, and using camera interpolation (view synthesis, for instance Seitz and Dyer [15]) to create views of the participants from arbitrary directions. Our approach differs from Fields’ in that we support multiple viewers at each site with correct viewing angles by utilizing multi-view displays such as 3D stereo or autostereoscopic displays. Also, Fields did not simulate or implement his approach but only described the idea. Another related approach which uses VR technology is MASSIVE [5]. It uses a spatial model of interactions which extracts the “focus” and “nimbus” of the conferees, making the perception of conferees at the remote site relative to the local site’s conferee positions and orientations. It also allows users to use audio, graphics, and text media through the network. Our system focuses on directionally-correct viewing, which MASSIVE does not address.
26
M. Zhang, S. Rotkin, and J. Schulze
(a)
(b)
Fig. 1. Teleconferencing scenarios with (a) two and (b) three participating sites. The circles indicate users, the rectangles indicate the tables the users sit at. In front of each table is a screen. The dashed lines are the lines of sight for User A, the dotted lines are those for User B. The numbered squares are the intersections of the lines of sight for User A, the numbered triangles are those for User B. Those squares and triangles indicate where on the screen the users at Site 1 have to see the other participants for consistent lines of sight between any two users; these are also the locations where cameras would have to be installed.
A different family of approaches to achieve correct eye contact directions uses the real-time generation of 3D human models. Ohya [9,10] proposes the Virtual Space TELeconferencing (VISTEL) system. By generating a 3D model of the participants and modeling their motion on a screen, this approach is able to achieve motion parallax and correct eye contacts. Even though the interaction may be smooth with real-time generation, the 3D images look artificial compared to images captured by cameras. Similarly, Kishino’s three-site virtual teleconference system reconstructs an entire human body in 3D and uses two large screens to create the sensation of a common space [6]. Notably, it discusses virtual object manipulation, which is simulated in our system as 3D datasets that float in the middle of the conference space and may be manipulated with a navigation pointer. Kishino’s and Yoshida’s work uses gesture and speech recognition to detect the user’s intent in building virtual objects, whereas we propose a system where the participants interact with datasets using a pointing device [18]. In the domain of head-mounted displays, there are also approaches that construct the virtual environment relative to the head position and orientation [4]. However, the interaction in such systems suffers from the disadvantages of head-mounted displays, such as limited resolution and field of view, noticeable lag on head motion, and the inability to directly see local conference participants.
Disambiguation of Horizontal Direction for Video Conference Systems
27
3 The VR Tele-conferencing System Mock-up The purpose of our VR prototype application is to verify the feasibility and effectiveness of our proposed tele-conferencing system. Our prototype is built as a C++ plugin for the COllaborative VIsualization and Simulation Environment (COVISE) [13], which is based on OpenSceneGraph [11] as the underlying graphics library. It runs at interactive frame rates (20-30 frames per second) in our StarCAVE, a 5-sided, 15 HD screen, rear-projected CAVE (Cave Automatic Virtual Environment)-like system with an optical tracking system. All parameters and user interactions can be controlled from within the immersive environment with a wireless 3D wand. Our application allows the user to interactively study the impact of the various camera and display parameters and accurately try out different vantage points and viewing, as well as pointing angles.
Fig. 2. Left: Our virtual conference room with only the static elements. Right: The full system with the dynamic objects, which includes movable screens, cameras, and participants.
Our virtual teleconferencing room draws ideas from the Cisco TelePresence 3000 system. That system consists of a room with three screens at the front and a half of an oval table, with room for up to six participants [16]. On top of the middle screen are three cameras with fixed focus. They each point to one pair of users at the table, and when viewed side-by-side the create a continuous image of the participants. In our virtual model, we replaced the three screens with a larger high resolution screen, which could in practice be constructed out of an array of narrow bezel LCD panels. And we replaced the cameras with an array of six cameras along a line in front of the wall, each focusing on one participant (instead of two). We modeled the static components of our conferencing room and the users in Autodesk 3ds Max and exported them to an OSG file for COVISE to read. The dynamic elements, such as the cameras and the screens, are created by our plug-in on the fly using OpenSceneGraph primitives. Figure 2 shows the static and dynamic elements of our virtual conference room. To mock-up the concept of a multi-site conferencing system, our prototype simulates a two-site conference by displaying two conference rooms back-to-back, which helped us in debugging what each conference participant should see on the screen. In such a setup, a straight line drawn from one participant to another in the other conference
28
M. Zhang, S. Rotkin, and J. Schulze
room illustrates the line of sight between those two participants, which is indicated as the dotted and dashed lines in Figure 1. This line intersects the display wall in front of the users in each room. These intersections are the locations in each of the two rooms where cameras should be placed and where the other participant should show up on the screen, in order to simulate correct viewing directionality. If we draw a line for every pair of participants, the number of intersection points with the screen equals the number of cameras and images needed. Now one can introduce a threshold distance, below which two neighboring cameras and images are to be merged into one, in order to reduce cost and increase available screen real-estate. Our prototype uses two types of cameras: one simulates the cameras on top of the screens that look at the participants, the other simulates what the participants see. In each room, there are six cameras of each type, all of which can be moved, panned, and tilted to adjust the view. These camera views are projected onto our virtual screen at the front of the conferencing room. The camera images from the cameras pointed at the remote participants are displayed with a render-to-texture approach just above the table, at the same height the Cisco system displays them, to display the users’ heads at their natural height. Above those images we display two of the six images from the cameras of the local viewers, to show what the six participants see. The latter images would not be displayed in a physical setup, they are only used to verify that the virtual users see the correct images. The operator of our teleconference simulator can use the 3D wand to adjust the position of the screens (with alpha blending, so that overlapping images can be blended together to form a continuous image), the position of the cameras, and the pan and tilt of the cameras. The user can also select which user’s views to display. 3.1 Automatic Camera Selection In a typical usage scenario, from the array of six in each room, the user chooses a set of active speakers, one speaker in one room, and two speakers in the other room. Thus, there are two pairs of active participants, where the single active participant in the first room can look at either or both of the active participants in the second room. In presentation mode, two cameras in each room are automatically chosen, each camera looking at one speaker, such that this camera is the closest to the line of sight between the pair of speakers 3. The images from these four chosen cameras are then projected onto the screens in each room. 3.2 Viewing 3D Models Viewing datasets as a group is a task that often helps to make discussions more clear. Our system demonstrates how participants of the conference can view datasets in the middle of the conference space together as a group. Our concept of the n-gon naturally lends itself to displaying the data model that is being discussed by the group in the center of the virtual n-gon. Our prototype can load a 3D dataset into the virtual space between the conference participants and the display wall, and it can be manipulated by moving and scaling it with the 3D wand.
Disambiguation of Horizontal Direction for Video Conference Systems
29
Fig. 3. Local (left) and remote (right) rooms with conference in session. For clarity, boxes are superimposed onto the photos: non-bold and bold small boxes denote inactive and active cameras, respectively; non-bold and bold large boxes denote inactive and active participants, respectively. After three active (bold) participants are chosen, one local (left) and two remote (right), four mounted cameras (bold) are automatically chosen, such that they are each the camera with the smallest possible distance to the line of sight between the pair of active participants.
When viewing a data set in the middle of the virtual conference space, there are two ways in which the data is displayed: one can either display the data set in the way it would be seen if it was a physical object, viewed by people around it, so that everybody sees a different side of the object. Or, alternatively, the object can be displayed so that every user sees the same side of it, which is similar to showing every participant the view of the same camera pointed at the object. Our prototype supports both concepts.
4 Discussion In this section we are going to discuss various topics we came across while developing and using the VR teleconferencing simulator. We strongly believe that the simulator acted as a catalyst to more rapidly gain insight into the complicated setup of cameras and displays our prototype proposes, but this claim is hard to quantify. 4.1 View Interpolation The more local and remote users there are, the more cameras are needed to capture correct lines of sight for the users. This relationship for the number of cameras C required for L local and R remote users is C = R × L, resulting in rapidly growing numbers as the number of users increases (O(n2 )). For example, in Figure 1(a), this number would come out to four for each of the sites, in Figure 1(b) it is six for each of Sites 1 and 3, and four for Site 2. In order to reduce the number of cameras required by the system for correct lines of sight, view synthesis approaches could be used. Seitz and Dyer [15] have shown that this can be done by using just two cameras images; we hypothesize that more cameras are needed for higher quality images the more remote participants there are.
30
M. Zhang, S. Rotkin, and J. Schulze
4.2 User View Separation In our simulator we experimented with displaying the different views for the different users in separate windows side by side. Each user gets to see a correct camera view of all remote participants, but their locations can overlap with the other local participants’. Hence, it is straightforward to consider display technology which can separate the views for the different users and for each user hide those views generated for the other local user(s). Stereo displays would allow for two separate views if the stereo glasses were modified to show either two left-eye images, or two right-eye images. This is easier to accomplish with polarizing glasses than active stereo shutter glasses. This approach is intriguing, but would require for the users to wear glasses, which would defeat the purpose of the system of allowing direct eye contact and correctness of gaze direction. Auto-stereoscopic displays can remedy this issue [12], and their quality has increased significantly throughout the past few years. Many of these displays can generate eight or more views. Most of them, however, would require for the users to sit in very specific locations so that they see their dedicated views. This constraint might not be too hard to satisfy, given that in the current Cisco system, the users also need to sit in relatively specific locations in order to show up in the right location on the screen. 4.3 Directional Correctness In our approach, we only consider horizontal directional correctness, but not vertical. Our assumption is that every user is located at the same “height”, so that only horizontal directionality matters. Vertical correctness could be achieved if the cameras were installed at eye level, behind or in front of the displays. Or, view interpolation could help solve this problem without obstructing the screen by doubling the number of cameras and installing the additional ones below the screen, and then interpolate between every vertical pair of cameras. 4.4 View Sharing for 3D Models Our video conferencing simulator implements the visualization of 3D models in the middle of the virtual conference space. Since for 3D models no physical cameras are needed, it is very easy to superimpose the rendering of this 3D model onto the video streams from the cameras. In this way it is possible to allow directionally correct viewing of 3D models in the midst of the participants, if the 3D model is virtually placed inside of the n-gon, which is where in Figure 1 the Network cloud is located. The limitation of this approach is that it only works well if the object is smaller than the virtual space between the conference participants. Ideally, the object is displayed below the sight lines between the participants, similar to the projection approach in Cisco’s teleconferencing system. This could be accomplished by extending the display walls down to floor level. Another strategy for view sharing could be that the system could automatically detect which participant is speaking and then show every participant the 3D object’s view of that participant, so that everybody has the same view as the speaker.
Disambiguation of Horizontal Direction for Video Conference Systems
31
5 Conclusion and Future Work We presented a virtual reality simulation tool for the development of future video conferencing systems, and discussed some of the unique features of our system and its limitations. Based on these simulations, we were able to confirm our hypothesis about directional disambiguation of multi-user, multi-site video conferencing systems, which was that by using a virtual site arrangement as an equilateral polygon we are able to convey gaze and pointing direction correctly between all conference participants. The simulator itself proved to be very capable of giving the user the impression of a real system, which gave us a much better basis for discussions and insight than sketches and descriptions would have. The next steps towards a future video conferencing system are to implement some of the proposed technology in the simulator: for instance, the algorithmic interpolation between camera views, and a simulation of multi-viewer display systems. In addition, we would like to verify our findings with live video and human participants to see if the human perception matches our findings. Eventually, we would like to build a physical prototype system to confirm the findings from our simulator. Acknowledgements. This research was made possible in part by the support of a Cisco Research Center Grant.
References 1. Baker, H.H., Tanguay, D., Sobel, I., Gelb, D., Goss, M.E., Culbertson, W.B., Malzbender, T.: The Coliseum Immersive Teleconferencing System. Technical Report by Hewlett-Packard Laboratories (2002) 25 2. Chu, R., Tenedorio, D., Schulze, J., Date, S., Kuwabara, S., Nakazawa, A., Takemura, H., Lin, F.-P.: Optimized Rendering for a Three-Dimensional Videoconferencing System. In: Proceedings of PRAGMA Workshop on e-Science Highlights, IEEE International Conference on e-Science, Indianapolis, IN, December 8-12 (2008) 25 3. Fields, C.I.: Virtual space teleconference system. US Patent 4,400,724 (August 1983) 25 4. Fuchs, H., Bishop, G., Arthur, K., McMillan, L., Fuchs, H., Bishop, G., Bajcsy, R., Lee, S.W., Farid, H., Kanade, T.: Virtual space teleconferencing using a sea of cameras. In: Proc. First International Conference on Medical Robotics and Computer Assisted Surgery (1994) 26 5. Greenhalgh, C., Benford, S.: Massive: a collaborative virtual environment for teleconferencing. ACM Transactions on Computer-Human Interaction, TOCHI 2 (September 1995) 25 6. Kishino, F., Miyasato, T., Terashima, N.: Virtual space teleconferencing communication with realistic sensations. In: Proc. 4th IEEE International Workshop on Robot and Human Communication (1995) 26 7. Loeffler, C.E.: Distributed virtual reality: Applications for education, entertainment, and industry. Telektronikk (1993) 24 8. Maimone, A., Fuchs, H.: Encumbrance-free Telepresence System with Real-time 3D Capture and Display using Commodity Depth Cameras (2011), http://www.cs.unc.edu/maimone/KinectPaper/kinect.html 25 9. Ohya, J., Kitamura, Y., Takemura, H., Kishino, F., Terashima, N.: Real-time reproduction of 3d human images in virtual space teleconferencing. In: Proc. Virtual Reality Annual International Symposium, VRAIS 1993, pp. 408–414 (1993) 26
32
M. Zhang, S. Rotkin, and J. Schulze
10. Ohya, J., Kitamura, Y., Takemura, H., Kishino, F., Terashima, N.: Virtual space teleconferencing: Real-time reproduction of 3d human images. Journal of Visual Communication and Image Representation 6, 1–25 (1995) 26 11. OpenSceneGraph. Scenegraph based graphics library (2004), http://www.openscenegraph.org 27 12. Peterka, T., Sandin, D.J., Ge, J., Girado, J., Kooima, R., Leigh, J., Johnson, A., Thiebaux, M., DeFanti, T.A.: Personal varrier: Autostereoscopic virtual reality display for distributed scientific visualization. Future Generation Computer Systems 22(8), 976–983 (2006) 30 13. Rantzau, D., Frank, K., Lang, U., Rainer, D., W¨ossner, U.: COVISE in the CUBE: An Environment for Analyzing Large and Complex Simulation Data. In: Proceedings of 2nd Workshop on Immersive Projection Technology, IPTW 1998, Ames, Iowa (1998) 27 14. Sandow, D., Allen, A.M.: The Nature of Social Collaboration: How work really gets done. Reflections 6(2/3) (2005) 25 15. Seitz, S., Dyer, C.: Physically-valid view synthesis by image interpolation. In: Proceedings IEEE Workshop on Representation of Visual Scenes (In Conjuction with ICCV 1995), pp. 18–25 (June 1995) 25, 29 16. Szigeti, T., McMenamy, K., Saville, R., Glowacki, A.: Cisco TelePresence Fundamentals. Cisco Press, Indianapolis (2009) 27 17. Wu, W., Yang, Z., Nahrstedt, K., Kurillo, G., Bajcsy, R.: Towards Multi-Site Collaboration in Tele-Immersive Environments. In: Proceedings of the 15th International Conference on Multimedia (2007) 25 18. Yoshida, M., Tijerino, Y.A., Abe, S., Kishino, F.: A virtual space teleconferencing system that supports intuitive interaction for creative and cooperative work. In: Proceedings of the 1995 Symposium on Interactive 3D graphics, SI3D (1995) 26
Immersive Visualization and Interactive Analysis of Ground Penetrating Radar Data Matthew R. Sgambati1 , Steven Koepnick1 , Daniel S. Coming1, , Nicholas Lancaster1 , and Frederick C. Harris Jr.2 1
2
Desert Research Institute Department of Computer Science and Engineering, University of Nevada, Reno {sgambati,koepnick,dcoming,nick}@dri.edu,
[email protected]
Abstract. Ground Penetrating Radar is a geophysical technique for obtaining information about sub-surface earth materials. Geologists use the data collected to obtain a view of terrain underground. This data is typically viewed using a desktop interface where the user usually interacts using a keyboard and mouse. Visualizing the data in a slice by slice 2D format can be difficult to interpret. Instead, we created a program for an immersive visualization environment that uses tracked input devices. This is done using real-time, stereoscopic, perspective-corrected, slice-based volume rendering. To aid with the visualization the user can modify the display of the volume using integrated tools, such as transfer functions, lighting, and color maps. Users are also given data analysis tools to take application-specific measurements such as dip, strike, other angles, and distances in 3D. Compared to typical desktop interface interactions, the 6-degree of freedom user interface provided by the immersive visualization environment makes it notably easier to perform the application-specific measurements.
1
Introduction
Ground Penetrating Radar (GPR) [1] is a geophysical technique used in such fields as archaeology, environmental site characterization, hydrology, sedimentology, and glaciology to obtain 3-D information about subsurface earth materials without the expense and difficulty of excavation or drilling [2]. The data gathered by GPR requires special software in order to be visualized as 2D slice data or a 3D volume. There are many software programs that visualize GPR data [3–8]; however, these applications have not been developed to display GPR data in an immersive visualization environment (IVE) with tracked input devices. IVEs allow the user to view the data in ways that desktop displays are not capable of, such as being able to view around the data and behind it without moving the data. Geologists also need to analyze it, including taking measurements of the thickness and orientation (dip and strike) of sedimentary units. Desktop tools are typically limited to the interactions provided by input
Corresponding author.
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 33–44, 2011. c Springer-Verlag Berlin Heidelberg 2011
34
M.R. Sgambati et al.
devices designed for 2D interfaces, while tools created for an IVE can use its tracking abilities, providing a different way to interact with the data. We present an immersive application for visualizing GPR and other seismic data with topographic correction and a surface for context, and we introduce interactive analysis tools for exploring and measuring this data and visualizing the dip and strike (surface orientation) of structures. Further, we integrate these tools and other improvements into an open-source immersive volume visualization application called Toirt-Samhlaigh [9].
2
Related Work
Tools exist for analyzing and viewing GPR data on desktop computers. GPRSLICE [5] creates 2D and 3D displays of GPR data and includes many tools, such as isosurfaces and topographic correction. Ground Vision and Easy 3D [6] support acquisition and visualization of GPR data. Another program called Easy 3D [7] visualizes data in 3D from a single channel or multi channel GPR system and provides viewing tools, such as filtering. Geoprobe [8] provides many tools to aid viewing and analysis of 3D GPR data, like multi-threaded volume attribute calculations and dynamic calculation and display of horizon-based attributes. These tools do not leverage immersive displays or 3D interaction. There are, however, immersive tools that visualize volumetric data, some of which support geological data [10–16]. Ropinski et al. [10] used a table display to explore seismic volume datasets using context and focus visualization metaphors. Visualizer [11] supports isosurface and streamline creation, slice visualization, and custom tools. Chopra et al. [12] visualized large-scale seismic simulations by simplifying, sampling, and filtering, and supported surface rendering. Other immersive volumetric tools use geometric representations of data, like isosurfaces and slices. Winkler et al. [13] extended a standard geoscience program to an immersive environment and displayed the desktop interface on a virtual surface. Fr¨ohlich et al. [14] let users immersively examine geo-scientific data (seismic volumes and well logs) using a prop based interation device and a sonification technique. Dorn et al. [15] presented a system for platform and well planning in an immersive environment, which imported and displayed surface and subsurface data. LaFayette et al. [16] visualized a GPR scan of an anthill by constructing an isosurface on the boundary between soil and air. CoRSAIRe [17] provides analysis of a fluid dataset using isosurface rendering of a simple surface, with haptic feedback according to an invisible isosurface. Other immersive systems have dealt with measurement tools. Kreylos et al. described their immersive LiDAR tool [18], which could measure distances and angles in 3D space. Hagedorn et al. let users measure objects in an IVE [19], using line, cylinder, and ellipsoid tools. Our system visualizes large volumetric datasets, including topographically corrected GPR data, in an immersive environment. We provide interaction tools to make generic and GPR-specific measurements, such as dip, strike, and distance.
Immersive Visualization and Interactive Analysis of GPR Data
(a)
(b)
35
(c)
Fig. 1. (a) Researchers use a GPR unit on a sand dune to send (b) GPR pulses through the ground, reflecting off surfaces [21] and, after generating many samples, create (c) 2D subsurface profiles [22]
Existing tools lack the means to measure dip and strike of features in subsurface geophysical data such as GPR. We also visualize these measurements, as well as the ground surface.
3 3.1
Background Ground Penetrating Radar
GPR uses the propagation of electromagnetic waves that respond to changes in the electromagnetic properties of subsurface materials. A GPR unit typically consists of transmitting and a receiving antennae to send and receive pulses into the ground, shown in Fig. 1(a). The ways in which this energy is reflected and scattered off surfaces and objects are expressed by the relative permitivity contrast between the target and the background. GPR surveys can provide information on the stratigraphic architecture, geometry, and correlation and quantification of sedimentary structures [1]. As seen in Fig. 1(b), as the waves leave the transmitter and travel through the ground they reflect off of subsurface structures. The receiver then detects the reflected waves and records the information. These pulses not only reflect off of objects in the subsurface, including cracks and voids, but also reflect off of materials with different dielectric properties. This means that GPR can detect subsurface features along with changes in the type of subsurface material to provide a map of the variation in ground electrical properties [20]. A field GPR survey to gather data is typically performed in a grid pattern. A researcher moves along this grid with GPR equipment, taking readings at each grid point. Fig. 1(c) shows an example GPR profile gathered on a sand dune. 3.2
Application of GPR to Studies of Sand Dunes
Sand dunes provide a favorable target for GPR studies because they have a high resistivity, which allows for good penetration of electromagnetic energy and they contain large-scale sedimentary structures that can be resolved by GPR [22]. Documenting and analyzing these sedimentary structures is important for understanding sand dune development and provides information on past
36
M.R. Sgambati et al.
(a)
(b)
Fig. 2. (a) A Brunton Compass [24] measures dip and strike, illustrated in (b) [25]
climates and wind directions. Deposits of ancient sand dunes also occur in the rock record. Many of these ancient aeolian sandstones are important reservoirs for hydrocarbons [23]. Characterizing the sediments of modern sand dunes in order to understand the conditions in which they formed requires measurements of the angle and direction of dip of primary and secondary sedimentary structures in order to determine the wind directions that formed them. Measurements of the thickness of sedimentary units are also frequently needed. In field studies, dip and strike of beds are measured using a Brunton Compass. These measurements are, however, hard to make using existing GPR visualization software packages. 3.3
Brunton Compass
A Brunton Compass, shown in Fig. 2(a), is a tool used by geologists to determine the dip and strike angles of surfaces. The angle of dip is a measure of the surface’s steepness relative to horizontal along its gradient. The angle of strike describes the orientation of the surface relative to North. The strike is measured using the strike line, a line on the surface that represents the intersection of a horizontal plane with the surface. The dip is measured using the steepest gradient from the strike line. Another way to represent strike is to use the dip direction, which is the gradient used for the dip measurement projected onto the horizontal surface that is used to create the strike line. This means that the dip direction is always 90 degrees off the strike line (Fig. 2(b)).
4
Application Design
Our goal was to improve available tools for geologists to visualize and analyze data from GPR studies of sand dunes. With a geologist, we identified several formal requirements for the tool: (a) visualize a 3D “data cube” comprising a stack of 2D profiles in Society of Exploration Geophysicists (SEG) Y [26] data format; (b) correct data alignment for topography; (c) collect dip/strike and distance measurements from the volume; (d) visualize dip/strike measurements. We kept a few special considerations in mind, regarding GPR measurements of sand dune subsurfaces. The subsurface is fairly homogeneous, with slight variations mostly due to moisture content. These slight variations are what geologists
Immersive Visualization and Interactive Analysis of GPR Data
37
are interested in. Therefore, we don’t expect to find easily segmentable features but rather want to leverage the domain knowledge that geologists have already developed in looking at 2D transects of sand dunes. We designed our application for use on several systems from CAVE [27] style displays to 3DTV-based displays. We assume stereoscopic perspective-corrected rendering based on head-tracking and a six degree-of-freedom input device with several buttons. We built our application on Toirt-Samhlaigh [9, 28], a volume rendering application that performs slice-based volume rendering [29] on 3D textures with bricking and an octree acceleration structure [30]. It has 1D and 2D transfer functions to map data values to colors and opacities. It has good support for both immersive displays and desktops and was built on the Virtual Reality User Interface (Vrui) [31] VR toolkit. Because Vrui abstracts the displays and input devices, our efforts focused on processing GPR data, analysis tools for GPR data, and ensuring that interface design decisions would work well on each system. Vrui provides an integrated user interface API with widgets and menus which work on both immersive and desktop systems. While designing our application, we strove to follow the following design principles: (a) target the interface to the user domain; (b) provide methods to explore data, extract measurements, and see focus in context; (c) minimize physical fatigue of the user interface; (d) design for wide variety of immersive displays and when possible, for non-immersive displays.
5
Visualizing GPR Data
To visualize GPR volume data we load a series of 2D slice files in SEG-Y format and stack them into a 3D volume. Then, we apply topographic correction to this volume, given a topography data file. We use Toirt-Samhlaigh’s volume rendering (Fig. 3(a)). And we can visualize the topography as a surface with adjustable transparency, on its own or with the volume. Topographic correction (Fig. 3) accounts for the slope of the terrain surface on which GPR data was collected. Without it, raw GPR data appears as if it was collected on a flat terrain. Topographic correction is necessary before measuring dip and strike, and for easier layer visualization. Topography data is stored as a height map of elevations surveyed at regular intervals as well as samples at peaks and ridges that do not fit the regular intervals so that they are not missed
(a)
(b)
Fig. 3. GPR data of a sand dune (a) without topographic correction applied and (b) with it applied
38
M.R. Sgambati et al.
(a)
(b)
Fig. 4. Surface visualization for a topographically correct GPR dataset with (a) no transparency and (b) 50% transparency (with a different transfer function)
by interpolation. We vertically shift each column of data in the volume by the elevation found through bilinear interpolation of the topography data and fill above the topography with null data. The surface visualization (Fig. 4) is a triangle mesh height field with a vertex for every column of the volume, with heights linearly interpolated from the topography data. Rendering order issues arise when rendering semi-transparent surfaces intersecting semi-transparent volumes with slice-based volume rendering. To resolve this, we apply multiple rendering passes. We first render the surface with front-face culling so that the back faces are rendered to the depth buffer. Then, we render the volume and finally re-render the surface with backface culling. The surface still incorrectly occludes data above it, but this is null data after topographic correction.
6
Interactive Analysis Tools
In this section we describe GPR data analysis tools. Our Brunton Compass Tool simulates its real life equivalent, allowing the user to take non-invasive dip and strike measurements of subsurface features, which can be visualized as a non-uniform vector field of gradients. Our Distance Measurement Tool is specialized for collecting distance measurements of interest to geologists. We also discuss tools provided by Toirt-Samhlaigh that are useful for this domain and our modifications to some of them. 6.1
Brunton Compass Tool
To measure dip and strike we created a virtual analog (Fig. 5(a)) of the Brunton Compass Tool (Fig. 2(a)). This tool provides the user with a plane that can be placed in the VR environment with the 6-DOF input device. Based on the orientation of the plane, the user is provided with the dip and strike, along with the coordinates of the plane’s center. Dip and strike are calculated in the volume’s coordinate system. First, the plane’s normal is transformed into this coordinate system. Note that this coordinate system may have non-uniform scaling, which must be accounted for.
Immersive Visualization and Interactive Analysis of GPR Data
(a)
39
(b)
Fig. 5. User interfaces for (a) Brunton Compass and (b) Distance Measurement Tool
(a)
(b)
(c)
Fig. 6. Brunton Compass tool and non-uniform vector field showing its measurements: (a) closeup taking a measurement (b) volume data and gradients, (c) gradients only
Next, the cross product of this normal and the volume’s up vector gives the strike line of the surface (Fig. 2(b)). Then the cross product of the strike line and the normal gives the steepest gradient. The dip direction is the cross product of the normal and the strike line, in the direction of steepest descent along the gradient. The dip angle is the smallest angle between the steepest gradient and the horizontal plane defined by the volume’s up vector, which also happens to contain the strike line. We calculate dip angle as the inverse cosine of the absolute value of the dot product of the dip direction and the gradient vector. The strike angle is the angle between the strike line and north, which we calculate by taking the inverse cosine of the dot product between strike and north. For ease of use, the user can change the size of the plane, save its current values, and choose between having the plane snap to the 6-DOF input device or perform transformations relative to the input device. The plane always moves relative to the 6-DOF input device; however, with snapping enabled, the plane is transformed to the position and orientation of the device before each movement. We created a user interface that displays the following information: current measurement (dip, strike, plane center), last five saved measurements, buttons to load and save, resize plane/vector, and toggle snapping and gradients. A non-uniform vector field visualizes dip and strike measurements with a cone at each measurement pointing in the direction of the steepest gradient. Saved measurements from a previous session can also be loaded. Fig. 6(c) is a good example of the usefulness of the Brunton Compass Tool’s gradients because they provide the user with an outline of the GPR data
40
M.R. Sgambati et al.
(a)
(b)
Fig. 7. 1D Transfer Function editors: (a) piecewise-linear and (b) multiple Gaussians
subsurfaces structure. The vector field does not render the strike lines because we felt they became visual clutter. Fig. 6 shows the Brunton Compass in use. 6.2
Distance Measurement Tool
To measure distances between points in 3D space, we reinvented the Distance Measurement Tool (Fig. 5(b)). VRUI provides a distance measurement tool, but upon initial testing, the geologist testing our system felt the interface provided too many options and too much information, and requested a version specialized to measurements relative to GPR. In our tool, the user creates start and end points using the 6-DOF input device to take a measurement. A marker is drawn at each point along with a connecting line segment. The user may opt to add labels to measurements when saving them: horizontal x, horizontal y, or vertical z. The following information is displayed on the tool’s interface: start/end point, distance, buttons for labels, and a ’Save’ button to save measurements to file. 6.3
Toirt-Samhlaigh Tools
Toirt-Samhlaigh provides useful analysis tools. With the 1D Transfer Function editor, users can modify the opacity and color values that Toirt-Samhlaigh applies to the data. They can edit the opacity map using piecewise-linear or multiple Gaussian functions (Fig. 7). Color values are linearly interpolated between control points. Users can save and load transfer functions. The lighting feature allows the user to apply a directional light to the data. The user can change the color of the ambient, diffuse, and specular lighting, alter
Fig. 8. Lighting being applied to GPR data with the lighting interface
Immersive Visualization and Interactive Analysis of GPR Data
(a)
41
(b)
Fig. 9. Applying the Slicing Tool to GPR data to (a) view axis-aligned slices and (b) clip the volume
the direction of the light, and save and load lighting settings. Fig. 8 shows an example of lighting being applied to GPR data, as well as its user interface. Users can attach a clipping plane tool to the 6-DOF tracked input device, and use a slicing tool to render axis-aligned slices of the GPR data. We enhanced the slicing tool to allow the user to treat these slices as clipping planes. Fig. 9 shows the slicing tools applied to GPR data.
7
Results and Lessons Learned
The geologist stated that this “application improves current tools used by researchers or practitioners who are interested in these datasets.” He tested this application in a six-sided CAVE-like display (1920x1920 pixels per side, activestereo, using two 1080p projectors that overlap), on a four-sided CAVE, on a low-cost 3DTV-based immersive display (67” with 1080p at half-resolution per eye) similar to the IQ-Station [32], and on a non-immersive laptop. All were sufficient for visualization and data exploration. The immersive displays offered the advantage of 3D interaction for dip, strike, and distance measurements. The geologist listed “interactive tools; ability to see [the] dataset from different view points; extraction of quantitative information on dip and strike” as major positives. He found that interactive tools “enable extraction of quantitative information from dataset.” “Compared to field investigations, working in the CAVE is a lot easier and quicker,” he said, and “to perform actual measurements of strike and dip on the beds and surfaces imaged by these data, we would have to excavate the dune and expose these features, which would be logistically difficult and not feasible in most cases.” He often moved to a different view point to quickly confirm precise placement of the Brunton Compass Tool. And we observed that he made use of the wider field of view of the CAVE style displays to obtain more view points and as workspace for user interface elements. The 3DTV-based display became cumbersome to use as the smaller screen became cluttered with user interface elements. Negative feedback from the geologist included “uncertainty in dip and strike measurements with scaling of datasets.” Measuring angles on a non-uniformly scaled volume can be disconcerting. Even if calculations are correct, results are
42
M.R. Sgambati et al.
non-intuitive. And there can be “variation in dip measurement due to step size” if the Brunton Compass Tool is not scaled larger than the step size. Lighting played a more important role in understanding the structure of the data than we first expected. Gradients are small in this data, making it difficult to obtain depth cues from occlusion or parallax (stereo or motion). With lighting applied, the increased visual gradient provided its own depth cues and increased the visual difference between nearby viewpoints, improving effectiveness of the other depth cues. Interacting with the lighting tool also provides structural cues as shading on surfaces changes. The geologist had difficulty using a pointer to interact with small (2”) 2D widgets just out of reach, but he was adept at placing and orienting the virtual brunton compass. For example, when using immersive displays, the Gaussian transfer function editor was much easier to use than the piece-wise linear transfer function editor, because it required fewer precise selection actions by the user to obtain a desired function. Sliding the Gaussian around by its center was also a quick way to explore a new dataset for interesting features. Perhaps a 2D touch tablet would be better for these 2D widgets, but carrying it might cause fatigue. The ability to log measurements while in the environment is crucial, in lieu of a notepad. Similarly important is saving and reloading as much of the system state as possible, whether to resume later or to show a colleague.
8
Conclusions
Existing tools for visualizing GPR data are bound by the limitations of a typical desktop display and input devices. We have presented a way of overcoming the limitations of that environment by creating a system that successfully allows for the visualization and interactive analysis of GPR datasets in an IVE. In the IVE, the user can explore the data from arbitrary viewpoints by moving around and inside the data. The tracked input devices provided the user with more natural ways of interacting with the data than are possible with typical desktop displays and input devices, as seen in Section 7. We created two immersive analysis tools which a geologist found very useful. The Distance Measurement Tool allows users to take specialized distance measurements, while the Brunton Compass Tool allows users to take dip and strike angle measurements. The topographic correction and surface visualization help the user understand the shape of the terrain’s surface. Additionally, the system provides many techniques for the user to view and interact with the data, such as changing its orientation and position, apply lighting, and transfer functions. The system is not limited only to GPR data, however. Our enhancements to Toirt-Samhlaigh can be applied to other data types. Also, saving and loading functionality increases Toirt-Samhlaigh’s user friendliness.
9
Future Work
This system would benefit from more user friendliness. A menu should be created to allow the selection of a data file to load or save. A tool to aid in data analysis
Immersive Visualization and Interactive Analysis of GPR Data
43
could restrict rendering of data to a user-defined shape. Another tool could automatically or semi-automatically segment the data into layers which the user could then peel off. The last tool could generate isosurfaces to help visualize the structure of the subsurfaces. The ability to change the scale of the volume on any axis, quickly swap datasets while the program is running, or render multiple volumes would be useful. We would also like to support additional data file formats. Finally, we plan to investigate bridging the gap between incorporating immersive visualization into scientific workflows and generating images for publication. Acknowledgements. This work is funded by the U.S. Army’s RDECOMSTTC under Contract No. N61339-04-C-0072 at the Desert Research Institute. We would also like to thank Patrick O’Leary, author of Toirt-Samhlaigh, without which this work would not have been possible, and Phil McDonald for his contributions to the SEG-Y data loader.
References 1. Bristow, C., Jol, H.: An introduction to ground penetrating radar (GPR) in sediments. Geological Society London Special Publications 211(1), 1–7 (2003) 2. Jol, H., Bristow, C.: GPR in sediments: advice on data collection, basic processing and interpretation, a good practice guide. Geological Society London Special Publications 211(1), 9–27 (2003) 3. Nuzzo, L., Leucci, G., Negri, S., Carrozzo, M., Quarta, T.: Application of 3D visualization techniques in the analysis of GPR data for archaeology. Annals of Geophysics 45(2), 321–337 (2009) 4. Sigurdsson, T., Overgaard, T.: Application of GPR for 3-D visualization of geological and structural variation in a limestone formation. J. Applied Geophysics 40(13), 29–36 (1998) 5. Goodman, D.: GPR-SLICE Software (2010), http://www.gpr-survey.com/ 6. Mal˚ a GeoScience: Windows based acquisition and visualization software (2010), http://www.idswater.com/water/us/mala_geoscience/data_acquisition_ software/85_0g_supplier_5.html. 7. AEGIS Instruments: Easy 3D - GPR Visualization Software (2010), http://www. aegis-instruments.com/products/brochures/easy-3d-gpr.html 8. Halliburton: GeoProbe Volume Interpretation Software (2011), http://www. halliburton.com/ps/Default.aspx?navid=220&pageid=842 9. O’Leary, P.: Toirt-Samhlaigh (2010), http://code.google.com/p/toirt-samhlaigh/ 10. Ropinski, T., Steinicke, F., Hinrichs, K.H.: Visual exploration of seismic volume datasets. J. WSCG 14, 73–80 (2006) 11. Billen, M., Kreylos, O., Hamann, B., Jadamec, M., Kellogg, L., Staadt, O., Sumner, D.: A geoscience perspective on immersive 3D gridded data visualization. Computers & Geosciences 34(9), 1056–1072 (2008) 12. Chopra, P., Meyer, J., Fernandez, A.: Immersive volume visualization of seismic simulations: A case study of techniques invented and lessons learned. In: IEEE Visualization, pp. 497–500 (2002) 13. Winkler, C., Bosquet, F., Cavin, X., Paul, J.: Design and implementation of an immersive geoscience toolkit. In: IEEE Visualization, pp. 429–556 (1999)
44
M.R. Sgambati et al.
14. Fr¨ ohlich, B., Barrass, S., Zehner, B., Plate, J., G¨ obel, M.: Exploring geo-scientific data in virtual environments. In: IEEE Visualization, pp. 169–173 (1999) 15. Dorn, G., Touysinhthiphonexay, K., Bradley, J., Jamieson, A.: Immersive 3-D visualization applied to drilling planning. The Leading Edge 20(12), 1389–1392 (2001) 16. LaFayette, C., Parke, F.I., Pierce, C.J., Nakamura, T., Simpson, L.: Atta texana leafcutting ant colony: a view underground. In: ACM SIGGRAPH 2008 Talks, vol. 53(1) (2008) 17. Katz, B., Warusfel, O., Bourdot, P., Vezien, J.: CoRSAIRe–Combination of Sensori-motor Rendering for the Immersive Analysis of Results. In: Proc. Intl. Workshop on Interactive Sonification, York, UK., vol. 3 (2007) 18. Kreylos, O., Bawden, G.W., Kellogg, L.H.: Immersive visualization and analysis of LiDAR data. In: Proc. Intl. Symposium on Advances in Visual Computing, pp. 846–855 (2008) 19. Hagedorn, J., Joy, P., Dunkers, S., Peskin, A., Kelso, J., Terrill, J.: Measurement Tools for the Immersive Visualization Environment: Steps Toward the Virtual Laboratory. J. Research of the National Institute of Standards and Technology 112(5) (2007) 20. Griffin, S., Pippett, T.: Ground penetrating radar. Geophysical and Remote Sensing Methods for Regolith Exploration. CRC LEME Open File report 144, 80–89 (2002) 21. Subsurface Detection: Subsurface Detection. If it’s in the ground, we’ll find it (2010), http://www.subsurface.com.au/GPR.html 22. Bristow, C., Duller, G., Lancaster, N.: Age and dynamics of linear dunes in the Namib Desert. Geology 35(6), 555–558 (2007) 23. Reading, H.: Sedimentary environments: processes, facies, and stratigraphy. WileyBlackwell, Oxford (1996) 24. Brunton Inc.: Brunton geo pocket transit (2010), http://www.brunton.com/product.php?id=190 25. Wikipedia: Strike and dip (2010), http://en.wikipedia.org/wiki/Strike_and_dip 26. Norris, E., Faichney, A.: SEG Y rev 1 Data Exchange format. Technical Standards Commitee SEG (Society of Exploration Geophysicists) (2002) 27. Cruz-Neira, C., Sandin, D.J., DeFanti, T.A., Kenyon, R.V., Hart, J.C.: The CAVE: audio visual experience automatic virtual environment. Commun. ACM 35(6), 64– 72 (1992) 28. O’Leary, P., Coming, D., Sherman, W., Murray, A., Riesenfeld, C., Peng, V.: Enabling Scientific Workflows Using Immersive Microbiology. In: DVD Created for and Used in IEEE Visualization Conf.: Workshop on Scientific Workflow with Immersive Interfaces for Visualization (2008) 29. Salama, C., Kolb, A.: A vertex program for efficient box-plane intersection. In: Proc. Vision, Modeling, and Visualization, pp. 115–122 (2005) 30. Ruijters, D., Vilanova, A.: Optimizing GPU volume rendering. J. WSCG 14(1-3), 9–16 (2006) 31. Kreylos, O.: Environment-independent VR development. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part I. LNCS, vol. 5358, pp. 901–912. Springer, Heidelberg (2008) 32. Sherman, W.R., O’Leary, P., Whiting, E.T., Grover, S., Wernert, E.A.: IQ-Station: a low cost portable immersive environment. In: Proc. Intl. Symposium on Advances in Visual Computing, ISVC 2010, pp. 361–372 (2010)
Handymap: A Selection Interface for Cluttered VR Environments Using a Tracked Hand-Held Touch Device Mores Prachyabrued1, David L. Ducrest2, and Christoph W. Borst1 2
1 University of Louisiana at Lafayette, Lafayette, Louisiana, USA Louisiana Immersive Technologies Enterprise, Lafayette, Louisiana, USA
Abstract. We present Handymap, a novel selection interface for virtual environments with dense datasets. The approach was motivated by shortcomings of standard ray-casting methods in highly cluttered views such as in our visualization application for coalbed methane well logs. Handymap uses a secondary 2D overview of a scene that allows selection of a target when it is occluded in the main view, and that reduces required pointing precision. Reduced sensitivity to pointing precision is especially useful for consumer-level VR systems due to their modest tracking precision and display sizes. The overview is presented on a tracked touch device (iPod Touch) that is also usable as a general VR wand. Objects are selected by a tap or touch-move-release action on the touch surface. Optionally, redundant visual feedback and highlighting on the main display can allow a user to keep focus on the main display and may be useful with standard wand interfaces. Initial user feedback suggests Handymap can be a useful selection interface for cluttered environments but may require some learning.
1 Introduction We present Handymap, a novel selection interface that uses a tracked hand-held touch device to address occlusions in highly cluttered views and that does not hinge on ray pointing precision. We are developing a VR-based visualization system for geological interpreters to interpret well log data (spontaneous potential and resistivity curves) from wells situated in Northern Louisiana. Fig. 1 (left) shows a scene from the application. The database contains several hundred well logs that the application can display, creating cluttered views even when smaller subsets are displayed. This causes selection problems with ray-casting interfaces [1]. Standard ray casting uses a virtual ray extended from a hand or controller to select the first intersected object. In a cluttered view, it can be difficult to select a target due to occlusions. A standard ray interface requires navigation to resolve difficult occlusions, which may increase selection time or disturb the view context. In a less extreme case, occlusions reduce the selection target area, making ray-casting slower and less accurate [2] due to increased required pointing precision. The problem appears especially when a user selects distant targets, which occurs in our application in which the user overviews a large collection of well logs. A small hand movement becomes a large distant ray movement, reducing pointing precision. This problem is also notably increased in consumer-level VR setups with modest tracking precision and display sizes. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 45–54, 2011. © Springer-Verlag Berlin Heidelberg 2011
46
M. Prachyabrued, D.L. Ducrest, and C.W. Borst
Fig. 1. Left: Low-cost well log visualization system (Mitsubishi 3D DLP TV and iPod Touch with markers for OptiTrack camera-based tracking) showing well logs (curves) hanging underneath a terrain generated from SRTM (Shuttle Radar Topography Mission) data. The iPod Touch presents an overview of the well log scene that resolves occlusion in the main view and supports rapid touch-based selection. A middle vertical line represents a virtual ray in the main view, which is locked during a selection step. Right: (Constructed conceptual illustration) Well log “picks” illustrated as horizontal lines with associated depth and text annotation. A highlighted pick on the left log is being associated with a pick on the right log by a drag gesture.
Various selection techniques address occlusions or pointing precision problems (see Section 2). However, they may not be adequate for our well log application, so we developed Handymap selection. It exploits the scene structure that has a well log dataset distributed on a terrain surface (well log curves hanging underneath) by using a secondary 2D overview of the scene as shown in Fig. 1 (left). It presents the overview on a tracked touch device (iPod Touch) that is also usable for conventional ray interactions. The overview represents well logs with circles and labels surrounding a virtual ray to resolve occlusion in the main view. Although a virtual ray extends from the iPod Touch similar to standard ray-casting interface, it is not used for conventional intersection-based selection. Instead, users touch the handheld display with a thumb to select a well log from the overview. Handymap visuals can also be used with a standard controller (e.g., Logitech gamepad or InterSense Wand) by presenting the overview on a main display and requiring joystick or pointing interactions. However, the iPod Touch interface allows direct touch selection that may be faster, supports intuitive gesture-based overview adjustments (zooming and panning), and reduces clutter on the main display. Additionally, the touch interface can further aid interpretation, e.g., to improve management of well log “picks”, which are depth levels selected on logs for geologic relevance. The touch surface provides direct depth selection for picks as well as efficient text annotation via its virtual keyboard that allows faster selection of characters than standard VR wand techniques. Geological interpreters have requested to manipulate multiple well logs on the iPod Touch as it may help relate picks between well logs (Fig. 1, right). The interpreters could use the related picks to generate a coarse subterranean surface representation of underground composition.
Handymap: A Selection Interface for Cluttered VR Environments
47
After describing the Handymap interface, we report initial user feedback. It suggests that Handymap will improve selection in our well log application, and suggests important features and considerations for the interface.
2 Related Work 2.1 Selection Techniques Addressing Occlusion or Precision Problems We summarize relevant ray-based approaches due to ray interaction dominance and because studies [3, 4, 5] have shown it has better selection performance than techniques like virtual hand-based selection [6]. Olwal and Feiner [7] presented Flexible Pointer, which allows users to bend a virtual ray to point to fully or partially obscured objects. Wyss, Blach, and Bues [8] presented iSith, a technique that addresses occlusions by using an intersection of two rays to define a target. Grossman and Balakrishnan [5] presented Depth Ray, Lock Ray, Flower Ray, and Smart Ray techniques that include mechanisms to disambiguate a target from multiple intersected objects along the ray. All those techniques require the ray(s) to intersect with the target and suffer from limited pointing precision at long distances and with tracker jitter. Also, Flexible Pointer and iSith require additional tracked input devices. Selection techniques such as flashlight (Liang and Green [9]) or aperture (Forsberg, Herndon, and Zeleznik [10]) lower required pointing precision by replacing the virtual ray with a conic selection volume. Frees, Kessler, and Kay [11] presented the PRISM enhanced version of ray-casting that increases user pointing precision by dynamically adjusting the control/display ratio between hand and ray motions. All those techniques do not work well in highly cluttered views or do not address the case of a fully occluded target. Kopper, Bacim, and Bowman [12] recently presented Sphere-casting refined by QUAD-menu (SQUAD) that addresses occlusions and does not require high pointing precision. It uses sphere volume to define an initial set of selectable objects, and it progressively refines the set using QUAD-menu selection until the set contains only the target. However, evaluation showed that it may not work well with highly cluttered environments due to the required number of refinements. Also, its selection process does not preserve spatial information, while we want a technique that shows some spatial relations. 2.2 Handheld Device Interfaces for Virtual Environments Aspin and Le [13] compared a tracked tablet PC to a tracked gamepad in immersive projection display environment. They found that using the tablet PC created a greater sense of immersion. Users developed a stronger relationship with the virtual environment because of the interactive visuals and tactile sensation of the tablet. Olwal and Feiner [14] leveraged the high visual and input resolution of a touch display on a tracked mobile device for improved interaction on a large touch display (zooming and selection of small targets). Their user study showed overall higher user preference for this approach over direct touch interaction on the large display. Katzakis and Hori [15] evaluated use of accelerometers and magnetometer on a mobile phone for a 3D rotation task. Their results showed it to be faster than mouse and tablet interactions.
48
M. Prachyabrued, D.L. Ducrest, and C.W. Borst
Kim et al. [16] presented a navigation technique called Finger Walking in Place (FWIP) using finger motions resembling leg walking motions on a multi-touch device. This was later adapted to iPhone/iPod Touch for navigation in a CAVE [17]. Song et al. [18] presented volume data interactions using a multi-touch wall display and iPod Touch. In addition to using multi-touch gestures on the iPod Touch, slicing plane position (on the wall display) could be controlled by sliding the iPod Touch on the wall display, with orientation of the slicing plane controlled by tilt sensing on the iPod Touch. Slices could then be annotated on the iPod touch.
3 Handymap Design 3.1 Map Overview Handymap presents a 2D overview of the virtual environment. We consider different perspectives for the overview, based on projections along a world-up axis, terrain-up axis, or controller-up axis. In any case, the overview represents a 3D region in the environment, where position and orientation of the region change with controller pose and inputs. Well logs in this region are represented by labeled icons on the overview. The overview can be zoomed and panned by scaling and translating the region. To address hand instability and tracker jitter, Handymap incorporates a ray-locking behavior where the overview becomes static, i.e., independent of additional virtual ray (controller) motion during selection and overview adjustment. 3.2 Handymap Interaction We consider two main interaction types: overview and scene. Overview gestures control prominent features within the overview: well log selection, overview zooming, and overview panning. Overview gestures were our primary focus, but we also incorporate scene-specific gestures to manipulate the scene, e.g., world-grab, view azimuth/elevation, scene panning, and terrain scaling. Overview gestures (Fig. 2) rely on the user’s primary hand and especially the thumb. Well log selection uses a touch-refine-release approach. The user touches the iPod display to initiate the interaction, tentatively indicating the well log closest to the touch point (Fig. 2a). The user can change (refine) the indication by moving the thumb closer to another icon while maintaining touch. Finally, the user releases the touch to select an indicated well log. During this interaction, an indicated well log is highlighted both on the iPod and on the main display. The user can additionally pan the overview region during selection refinement by dragging the thumb to any edge of the display (Fig. 2c). This is a temporary pan and is forgotten when the touch ends. To cancel selection, the user releases the touch at any edge of the display (in a panning zone). To zoom the overview, the user touches the display with the primary thumb and forms a pinch gesture with the secondary hand (Fig. 2b). Ray-locking can be enabled as a system parameter. If enabled, the overview region is independent of ray (iPod) motion during overview gestures. The best default behavior depends on the VR system type and user characteristics, e.g., consumer-level VR setups with notable tracker jitter may require ray-locking enabled.
Handymap: A Selection Interface for Cluttered VR Environments
49
Fig. 2. Overview gestures for well log selection and overview adjustment: (a) Touch-refinerelease to highlight and select a well log. (b) Pinch gestures for zooming the overview. (c) Drag gestures for panning the overview (entering the red area pans the overview forward).
Prototype scene gestures typically use the secondary hand. The user paws the iPod display with two fingers next to each other to pan the scene along the world floor plane. Pawing with two fingers separated adjusts view elevation, and rotating one finger about the other finger adjusts azimuth. To uniformly scale the scene, the user pinches two fingers. For grab-the-world type scene manipulation, the user taps once with the primary thumb, then taps and holds to clutch (grab), then moves the (tracked) iPod Touch in 3D space, and finally releases the touch to end the grab. We use a state machine to prevent distinct gestures from overlapping. A refine gesture (both target indication and overview panning) can transition to any other gesture except world-grab. Overview zooming can transition to refining but not to scene gestures. A fast single tap will not result in selection but is used to detect world-grab. A scene gesture must end (no finger on the surface) before another gesture is detected. 3.3 Handymap Perspective and Overview Calculation Handymap perspective determines how the scene is projected for the overview. It affects occlusions in the calculated 2D overview and consistency between object layout in the overview and in the main view. We considered three perspectives: 1. 2. 3.
World-based: The overview is displayed like a view down from the top of the main display, i.e., parallel to the real world floor (Fig. 3a). Terrain-based: The overview is displayed as though it is parallel to the terrain, i.e., view direction normal to the terrain’s principal plane (Fig. 3b). Controller-based: The overview is displayed as though it is parallel to the controller face, i.e., view direction defined “controller-up” axis (Fig. 3c).
In all cases, the overview still rotates according to orientation of the controller (iPod) with respect to an axis parallel to the projection direction.
50
M. Prachyabrued, D.L. Ducrest, and C.W. Borst
Fig. 3. Three Handymap perspectives considered: (a) World-based perspective. (b) Terrainbased perspective. (c) Controller-based perspective. The figure represents the controller (e.g., iPod Touch) and views from the main display. {W}, {T}, {C} refer to fixed world (main display), terrain, and controller coordinate frames, respectively.
3.3.1 Overview Calculation Given a Handymap perspective, a Handymap coordinate frame is calculated as detailed in the following subsections. The 3D region mapping to the overview is centered and aligned on this coordinate frame. Mapping well log positions for Handymap icons is done by transforming positions to this coordinate frame (reference positions near the terrain surface). The overview shows only well logs whose Handymap coordinates fall within the mapped region based on current scale. 3.3.2 World-Based Perspective The world-based perspective provides a consistent object layout between the overview and the main view, e.g., objects to the left of the virtual ray in the main view (from user’s usual perspective, independent of controller rotations around the ray axis) are represented on the left side of the virtual ray representation in the overview. The world-based perspective has occlusions in the overview when the terrain tilts significantly away from horizontal (with respect to the world). The Handymap coordinate frame origin is computed as a fixed point on the virtual ray projected onto the world floor or horizontal (XZ) plane. We chose this fixed point by considering a user’s typical interaction depth (i.e., typical distance between user and dataset) so that the overview region falls on the terrain where objects of interest reside. The Handymap up (Y) axis matches world up (Y) axis. The Handymap forward (-Z) axis is computed as the virtual ray direction vector projected to the world floor plane. The Handymap left (-X) axis is found by axes cross product. We chose world-based perspective as the default perspective because it provides a consistent layout and is not limited by our well log data. Since our terrain is nearly planar, it is uncommon and unnecessary to rotate the terrain far from horizontal. With reasonable scale for the overview region, our well log application has no occlusion in the overview with world-based perspective.
Handymap: A Selection Interface for Cluttered VR Environments
51
3.3.3 Terrain-Based Perspective The terrain-based perspective better preserves object spacing and eliminates occlusion in the overview when objects are distributed on the terrain’s surface (assuming reasonable overview scale). However, the object layout in the overview may be inconsistent with the main view, e.g., objects to the left of the virtual ray in the main view (defined as before) could be on the right side of the overview if the terrain is flipped up-side down in the world. The Handymap coordinate frame for terrain-based perspective is computed similarly to world-based perspective (Sect. 3.3.2) except that the calculation uses terrain floor (XZ) plane and terrain up (Y) axis in place of world floor and world up. With normal constrained terrain rotation in our well log application, there is no layout consistency problem. Since there is also no occlusion in the overview, the terrain-based perspective works about as well as the world-based perspective. 3.3.4 Controller-Based Perspective In the controller-based perspective, the Handymap coordinate frame is simply the controller frame translated to the fixed point (Sect. 3.3.2) on the virtual ray. The controller-based perspective suffers from both occlusion (in the overview) and layout consistency problems depending on controller orientation. However, it provides the user with the most full and direct control of the overview. The user is free to adjust the overview to avoid these problems. The controller-based perspective may be a good option for 3D data inspected from more angles, but it does not provide notable benefit in our well log application. 3.3.5 Zooming and Panning the Overview Zooming the overview is accomplished by scaling the overview region. We chose a default region size intended for good distribution of well log icons in the overview. Panning the overview is accomplished by translating the Handymap frame origin on its local view plane axes.
4 User Feedback We solicited feedback from 5 users about their expectations and suggestions for Handymap based on 30-45 minute sessions. The users were three geosciences domain experts (one with VR experience and previous experience with the application) and two VR experts with some prior exposure to the application. We used the equipment and scene shown in Figure 4. We asked each user to compare Handymap to standard ray-casting for well log selection in a cluttered view (Fig 4). All users believe Handymap improves selection (easier and faster) because the overview resolves occlusions, allowing selection without navigation. This is especially appreciated by the domain experts since it does not disturb their interpretation context (e.g., when they want to select two well logs from the same view for comparison). One domain expert and one VR expert stated that raycasting may be better when a target is close and unobstructed. Two domain experts indicate that Handymap requires learning. One domain expert estimated that it took 10 minutes for proficiency but still expressed a preference for Handymap.
52
M. Prachyabrued, D.L. Ducrest, and C.W. Borst
Fig. 4. Environment on Visbox HD13 with Intersense IS-900 tracking of controller and head
We asked for feedback during both selection of specified targets and free exploration. One domain expert and one VR expert stated that clear presentation of log labels in the main view is important to allow them to find matching target on the iPod touch confidently. One VR expert stated that they could locate a target in the overview easily by relating it to the virtual ray. Two domain experts and one VR expert commented that overview zooming is useful, since it allows them to use Handymap for a larger region and allows finer interaction. One domain expert and one VR expert commented that additional terrain representation on Handymap can be useful, but should be optional. Geologists usually consider topography irrelevant to these interpretations. Two domain experts commented that seeing a target in the main view when absent in the overview was confusing, demonstrating the importance of reasonable overview scale. One VR expert commented that fingers interfere with text reading on iPod Touch. All users commented that a focus shift between the main display and iPod Touch was a drawback but still express preference for Handymap. One domain expert suggested that tilting the touch surface to the user’s eyes during interaction would reduce focus shift. Another domain expert suggested selection should not be cancelled when releasing touch in a panning zone. Two VR experts suggested that additional representations of overview region and touch point in the main view may be helpful. We also asked each user to test display alternatives for Handymap visuals. One case used main display visuals instead of iPod visuals, placing the overview at the bottom center and aligned with the main display surface. One VR expert stated that the overview cluttered the display and was confusing, further stating that the overview did not feel like a top-down view due to the alignment. They suggested that tilting the overview may help. The other users liked the reduced focus shift, with two domain experts stating that it helps mental focus. A domain expert stated a large overview is helpful. The other two domain experts stated that a single display helped them relate overview to main view. One VR expert stated it avoids finger interference with labels. Another approach was to omit visual overview and mainly use the Handymap for touch input. In this case, the visual cue, on the main display, was to dynamically
Handymap: A Selection Interface for Cluttered VR Environments
53
highlight the log corresponding to thumb position. One domain expert stated that selecting from the overview was easier. The other users liked the reduced focus shift. One VR expert stated that it related interaction to the main view. One domain expert and one VR expert commented that overview panning is useful since it allows continuous interaction even without looking at the iPod. One VR expert suggested that visual cues in the main view for panning would help, or to limit panning range. Based on responses, we believe that the touch input aspect of the iPod was more important than its visual display, and extending visual feedback associated with Handymap on the main display is a good next step. We expect the touch display surface to further be useful for other tasks in our application, as suggested in the introduction.
5 Conclusion and Future Work We summarized the occlusion and pointing precision problems with standard raycasting in cluttered virtual environments. We described the Handymap selection interface to address these problems in a well log visualization application. User feedback suggests Handymap can be a useful interface for cluttered environments, but that it may require some practice. Easy association of a target in the main view with the corresponding representation in the overview, touch input surface, redundant feedback in the main view, and overview zooming and panning features are important. Future work should include formal evaluation of Handymap with comparison to other techniques and understanding of design tradeoffs. For example, we want to evaluate the iPod visual display for Handymap to see if it impacts performance over presenting visuals on the main display, considering the focus shift between the touch display and the main display. We will consider extensions to Handymap, e.g., additional 3D representations of overview region and touch point in the main view, or by investigating auto-scaling of the overview region. Finally, we will continue to extend our iPod Touch interface for well log interpretation.
References 1. Mine, M.R.: Virtual Environment Interaction Techniques. Technical Report, University of North Carolina at Chapel Hill (1995) 2. Steed, A., Parker, C.: 3D Selection Strategies for Head Tracked and Non-Head Tracked Operation of Spatially Immersive Displays. In: 8th International Immersive Projection Technology Workshop (2004) 3. Poupyrev, I., Weghorst, S., Billinghurst, M., Ichikawa, T.: Egocentric Object Manipulation in Virtual Environments: Empirical Evaluation of Interaction Techniques. Computer Graphics Forum 17(3), 41–52 (1998) 4. Bowman, D.A., Johnson, D.B., Hodges, L.F.: Testbed Evaluation of Virtual Environment Interaction Techniques. In: Proceedings of ACM Symposium on Virtual Reality Software and Technology (VRST), pp. 26–33 (1999) 5. Grossman, T., Balakrishnan, R.: The Design and Evaluation of Selection Techniques for 3D Volumetric Displays. In: Proceedings of ACM Symposium on User Interface Software and Technology (UIST), pp. 3–12 (2006)
54
M. Prachyabrued, D.L. Ducrest, and C.W. Borst
6. Bowman, D.A., Kruijff, E., LaViola, J.J., Poupyrev, I.: 3D User Interfaces: Theory and Practice. Addison-Wesley, Reading (2004) 7. Olwal, A., Feiner, S.: The Flexible Pointer: An Interaction Technique for Augmented and Virtual Reality. In: Conference Supplement of ACM Symposium on User Interface Software and Technology (UIST), pp. 81–82 (2003) 8. Wyss, H.P., Blach, R., Bues, M.: iSith – Intersection-based Spatial Interaction for Two Hands. In: Proceedings of IEEE Symposium on 3D User Interfaces (3DUI), pp. 59–61 (2006) 9. Liang, J., Green, M.: JDCAD: A Highly Interactive 3D Modeling System. Computers and Graphics 18(4), 499–506 (1994) 10. ForsBerg, A., Herndon, K., Zeleznik, R.: Aperture Based Selection for Immersive Virtual Environments. In: Proceedings of ACM Symposium on User Interface Software and Technology, pp. 95–96 (1996) 11. Frees, S., Kessler, G.D., Kay, E.: PRISM Interaction for Enhancing Control in Immersive Virtual Environments. ACM Transactions on Computer-Human Interaction 14(1), 2 (2007) 12. Kopper, R., Bacim, F., Bowman, D.A.: Rapid and Accurate 3D Selection by Progressive Refinement. In: Proceedings of IEEE Symposium on 3D User Interfaces (3DUI), pp. 67– 74 (2011) 13. Aspin, R., Le, K.H.: Augmenting the CAVE: An Initial Study into Close Focused, Inward Looking, Exploration in IPT Systems. In: Proceedings of IEEE Symposium on Distributed Simulation and Real-Time Applications, pp. 217–224 (2007) 14. Olwal, A., Feiner, S.: Spatially Aware Handhelds for High-Precision Tangible Interaction with Large Displays. In: Proceedings of International Conference on Tangible and Embedded Interaction (TEI), pp. 181–188 (2009) 15. Katzakis, N., Hori, M.: Mobile Devices as Multi-DOF Controllers. In: Proceedings of IEEE Symposium on 3D User Interfaces (3DUI), pp. 139–140 (2010) 16. Kim, J.-S., Gračanin, D., Matković, K., Quek, F.: Finger walking in place (FWIP): A traveling technique in virtual environments. In: Butz, A., Fisher, B., Krüger, A., Olivier, P., Christie, M. (eds.) SG 2008. LNCS, vol. 5166, pp. 58–69. Springer, Heidelberg (2008) 17. Kim, J., Gračanin, D., Matković, K., Quek, F.: iPhone/iPod Touch as Input Devices for Navigation in Immersive Virtual Environments. In: Proceedings of IEEE Conference on Virtual Reality (VR), pp. 261–262 (2009) 18. Song, P., Goh, W.B., Fu, C., Meng, Q., Heng, P.: WYSIWYF: Exploring and Annotating Volume Data with a Tangible Handheld Device. In: Proceedings of ACM Annual Conference on Human Factors in Computing Systems (CHI), pp. 1333–1342 (2011)
Virtual Interrupted Suturing Exercise with the Endo Stitch Suturing Device Sukitti Punak, Sergei Kurenov, and William Cance Roswell Park Cancer Institute
Abstract. This paper presents the surgical suturing simulator for wound closure, which is designed for education and training purposes. Currently it is designed specifically to support a simulation of the AutosutureTM Endo StitchTM suturing, but could be extended for other surgical instruments designed for intracorporeal suturing. The simulator allows a trainee to perform a virtual wound closure by interrupted suture with real surgical instrument handles customized to fit on haptic devices. The wound simulation is based on a triangular surface mesh embedded in a linear hexahedral finite element mesh, whereas the suture simulation is based on a simplified Cosserat theory of elastic rods. Our novel heuristic combination of physically-based and control-based simulations makes the simulator run efficiently in real time on mid-level desktop PCs and notebooks.
Fig. 1. A screenshot from the simulator
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 55–63, 2011. c Springer-Verlag Berlin Heidelberg 2011
56
1
S. Punak, S. Kurenov, and W. Cance
Introduction
Laparoscopic surgeries, including robotic surgeries, often entail the closing of wounds with sutures that require tying knots. However, with robotic and laparoscopic instruments in an intracorporeal environment, suturing and tying knots can be a challenging and time-consuming process. Several suturing devices have been developed that can reduce or eliminate the difficulties and time involved with tying knots in laparoscopic surgeries. However, the rapid development and deployment of novel minimally invasive instruments presents surgical educators with a significant challenge. For example, the Auto SutureTM EndoStitchTM device (Covidien) (Fig. 2) has been shown to reduce the time required for tying knots and produce knots of comparable, if not greater, strength than standard laparoscopic knot tying techniques [1]. However, these instruments often require skills significantly different from those used for conventional surgical knot tying. As such, there can be a significant learning curve involved in developing the skills necessary to efficiently and effectively use these new devices. This is unacceptable in today’s environment: throughput pressures in the operating room leave little room for delays or even mistakes. This paper describes the wound model, suture and knot tying simulations, implemented into the simulator, which allows a trainee to close a virtual wound with the Endo StitchTM suturing tool by using an interrupted suturing technique. The interrupted suturing technique is also known as an interrupted stitch, because the individual stitches are not connected to each other. This technique keeps the wound closed even if one suture knot fails. The technique is simple, but placing and tying each stitch individually is time-consuming [2]. The framework is modified, improved, and extended from our previous framework for continuous suturing simulation [3].
Fig. 2. Endo Stitch suturing device
Virtual Interrupted Suturing Exercise with the Endo Stitch Suturing Device
2
57
The Simulation Framework
The simulator is composed of four main sub-modules (Fig. 3): the Endo Stitch suturing tool attached to the haptic device, the Open wound model, the Suture model, and the Simulation control.
Fig. 3. Simulation diagram
2.1
Endo Stitch Suturing Device
The virtual instrument is created to emulate the shape of the Endo Stitch suturing tool and this virtual instrument is controlled by the movement of a PHANR device. For collision detection of the virtual instrument with TOM Omnihaptic other objects, four bounding cylinders for the shaft, top jaw, bottom jaw, and needle have been created (Fig. 4a). In order to allow a trainee to hold the real device handle during simulation, a real Endo Stitch suturing tool is modified to fit the haptic device. We have modified the surgical instrument with a method that is similarly described in [4]. Such modification allows the trainee to manipulate the modified handle in a manner similar with the real suturing instrument, but in a virtual environment. A similar modification is done for a grasper instrument. 2.2
Open Wound Model
The simulated open wound model is based on the linear hexahedral finite element method (FEM). The wound is simulated by a triangular surface mesh embedded in a linear hexahedral finite element (FE) mesh similar to the traditional FEM embedded deformation technique mentioned in [5]. This method of embedding the surface mesh in the FE mesh allows us to change the triangle mesh for the wound’s surface or the grid resolution of the FE mesh virtually independently of each other. The dynamic equation system of the model’s FE mesh is M¨ x + Cx˙ + K(x − xo ) = f ,
(1)
58
S. Punak, S. Kurenov, and W. Cance
¨ , x, ˙ and x are the accelerations a, velocities v, and positions of all FE where x mesh nodes, respectively. The displacements of nodes u are replaced by x − xo , where xo are the positions of undeformed nodes. M and C are the mass and damping matrices of the FEM model, respectively. The system is discretized with the time step Δt and solved iteratively during simulation by a modified preconditioning conjugate gradient (MPCG) solver [6]. The model’s triangular surface mesh is used for collision detection. A sphere bounding volume hierarchy (BVH) for the surface mesh is created for the broadphase collision detection. Penalty forces are generated based on the penetration depths from the narrow-phase collision detection. These forces are then converted to forces applied to the FE mesh. Therefore, the wound’s surface deformation is updated according to the deformation of the FE mesh. 2.3
Suture Model
The suture model is based on a simplified Cosserat theory of elastic rods. The model is a simplified version of the CoRdE model [7]. The Cosserat theory states that each material element is composed of centerlines (i.e., mass points) and directors (i.e., orientations). Therefore, the suture model can be discretized into a coupling of a chain of mass points and a chain of orientations. The model becomes a rigid chain of link cylinders. A suture’s link is defined by two consecutive mass points. The link’s orientation is controlled by the director located at the center of the link. By using calculus of variations, the Lagrangian equation of motion for an elastic rod is 1 ∂T ∂V ∂D ∂Cp ∂Cq d ∂T − + + +λ· +μ = Fe ds , (2) g˙ i dt ∂ g˙ i ∂gi gi ∂gi ∂gi 0 where the gi is the combined coordinates of a centerline and a director, and Fe are external forces and torques, whereas T , D, and C are the kinetic, potential, and constraint energies of the elastic rod, respectively. We have simplified it to 1 ∂Ec ∂Vs ∂Vb ∂Ec ∂V + = + + = Fs + Fb + Fc = Fe ds . (3) gi gi gi gi gi 0 The simplification was based on converting the dynamic model (2) to a semidynamic model (3) [8]. The discretized version is Fs [i] + Fb [i] + Fc [i] = Fe [i] ,
(4)
where the stretch Fs and bending Fb forces are computed from centerlines and directors respectively, and the constraint forces Fc are computed from both centerlines and directors. A semi-explicit Euler numerical time integration is used to update the position and orientation of each node i on the model. To render the suture, the combined chain of centerlines and directors is subdivided twice by the Chaikin’s algorithm, similar to the one described in [9]. A generalized cylinder is generated and rendered for the subdivision chain. The collision detection is implemented with a sphere BVH [10].
Virtual Interrupted Suturing Exercise with the Endo Stitch Suturing Device
3
59
Simulation Control
The simulation control communicates directly with the open wound, the suture, knot recognition, and the two haptic devices (Fig. 3). It creates and enforces all constraints based on the interactions among the three sub-modules: the haptic devices, the open wound model, and the suture model. It controls the simulation and rendering of the application, and accepts commands from trainee input. Here we discuss only its three main components, namely collision detection, interaction constraint, and finite state machine (FSM) for knot tying. 3.1
Collision Detection
This component checks for any collisions among the open wound, suture, and tools manipulated by a trainee. It gathers and uses the positions of the tools’ bounding volumes, open wound’s BVH tree, and suture’s BVH tree for collision detection at each time step (Fig. 4a). Each detected collision based on geometry will be converted to external forces sent back to the corresponding models, so that the models can use the forces to adjust the deformation and/or movement to resolve the collision. 3.2
Interaction Constraint
During a suturing simulation, parts of the suture model have to pass through the wound model via a pair of entry and exit puncture points created by the needle penetrating the wound. The method in [11] defined and used only entry and exit puncture points on a wound. In order to add more reality to the interaction between the suture and the wound, our simulation control’s interaction constraint component interpolates inner puncture points located between the entry and exit puncture points based on the suture’s link rest length (Fig. 4b). The interaction constraint component sends these puncture points, including the entry and exit puncture points, to the wound model to update the wound’s
(a) Collision detection
(b) Puncture points
Fig. 4. (a) The FE mesh and the device’s bounding cylinders and (b) the wound’s interpolated points
60
S. Punak, S. Kurenov, and W. Cance
simulation and rendering. This component also associates and maintains a set of the suture’s points connected to the puncture points. This includes the control of the suture’s movement through the entry/exit passage, when the force applied on the suture at an entry or exit puncture point is greater than a set threshold. Currently, only the forward movement is allowed, since the suturing procedure does not require a backward movement of the suture. This assumption helps reduce the complexity and computation time of the simulation. 3.3
FSM for Knot Tying (by Animation)
The finite-state machine (FSM) is designed to control the state of animated knot tying (Fig. 5a).
(a) FSM
(b) Knot Rendering
Fig. 5. Knot animation: (a) the FSM for knot tying, (b) knot chain rendering (top) and normal rendering (bottom)
There are four ways to tie a (single or double) knot on the wound (Fig. 6). In Fig. 6, RE and LE represent the suture end sticking out on the wound’s right and left side, respectively. CW and CCW represent the clockwise and counterclockwise directions of the number of wrapping loops (nwl) on the suture end, respectively. For a single knot the number of nwl is one, while for a double knot it is two. Based on the four ways to tie a knot mentioned above, the knot tying can be divided into four types (Fig. 6):1 single-left, single-right, double-left, and doubleright. The endings, -left or -right, represent the side of the wound that the suture end is sticking out of. Since we assume the open wound is a laceration, the left and right sides of the wound can be clearly identified. Currently the simulation control supports the following knots: square knot (single-left followed by singleright or single-right followed by single-left), granny knot (single-left followed by single-left or single-right followed by single-right), and surgeon’s knot (doubleleft followed by single-right or double-left followed by single-right). 1
To clearly show the loops, the suture radius was rendered 5 times bigger.
Virtual Interrupted Suturing Exercise with the Endo Stitch Suturing Device
(a) RE-CW
(b) RE-CCW
(c) LE-CW
61
(d) LE-CCW
Fig. 6. Four different ways to tie a single knot
The FSM detects and marks the number of wrapping loops (nwl) (Fig. 5a). When the nwl is greater than 1 and the distance from the Endo Stitch suturing device’s tool’s tip (ESDT ipP os) to the first entry puncture (F P P P os) is greater than a set threshold, the FSM sends a request to the suture model to create an animated knot. There must be two entry puncture points (and two exit puncture points) — one on each side of the wound — before the knot tying is allowed. The animated knot is created based on the direction of wrapping loops, the number of wrapping loops, and the wound side that the suture end is sticking out of. The animated knot shape is created by constraining a group of the suture points/links to form the defined knot (Fig. 5b). After the animated knot is created, the FSM state moves to the ‘Animate’ state. The suture model sends a message back to the FSM when the knot is tightened. After receiving the message, The FSM sends a confirmation to the suture to lock the animated knot. The animation of the knot is over and the FSM returns to the ‘Ready’ state. The interaction constraint component is also notified to connect the first entry puncture point with the second exit puncture point with a predefined force. It also connects the first exit puncture point with the second entry puncture point with another predefined force. These connections simulate the holding of the tied knot on the wound, and avoid a more complex computation for the interaction between the tied knot and the wound. To allow a knot combination, the simulation control supports the creation of another animated knot on top of a locked knot. The process is similar to when the first knot is created, except the knot is created on top of a locked knot and the knot type (i.e., name) is the combination of both knots (Fig. 5b). After a knot or a combined knot is created, the simulation control allows the trainee to cut the suture with a cutting tool. The cut creates a copy of the knot from the cutting point to the end of the suture. It also resets the main suture, so that the simulation for the next interrupted stitch can start over. To complete the procedure, the trainee has to finish five stitches along the wound (Fig. 7).2 Based on the steps in the FSM, an automated virtual coach (for interactive help) was developed to guide the trainee through the procedure. The trainee can choose to use or not use the virtual coach. 2
The images were retouched to highlight the suture.
62
S. Punak, S. Kurenov, and W. Cance
Fig. 7. The interrupted suture procedure
4
Results and Conclusion
The simulation was tested on a computer running Windows XP 32-bit OS, with R CoreTM i7-940 (2.93 GHz) CPU. The suture was simulated with 100 an Intel points. The simulated wound’s triangular surface mesh was composed of 2,178 vertices and 4,352 triangles. The wound’s linear hexahedral finite element mesh contained 500 nodes and 324 hexahedra. The simulation utilized a combination of physically-based and control-based simulations in order to continue running at an interactive rate. With two instruments — an Endo Stitch suturing device and a grasper — the simulation ran at approximately 20 fps when there were no or minor intersections and at approximately 10 fps with complex collisions and interactions. The simulation results (Fig. 7) demonstrate that the user can perform the wound closure by interrupted suture with the instruments in the virtual world simulated by the developed simulator. In [12], Coles, et al. presented an interesting point. At the time of their writing, they concluded that there is no rigorous scientific study that a low-cost simulator with three degrees of force feedback is better or worse than a higher cost simulator offering more degrees of force feedback. Our simulator belongs to the low-cost simulator category. We are aiming to create a low-cost and simple simulator that helps users learn the suturing procedure by practicing holding and manipulating the real device handles. This simulator creates a realistic behavior and allows users to be trained in the correct way of working before moving on to a laparoscopic wet lab. We plan to incorporate this simulator into a course for educating and training medical residents on how to use an Endo Stitch suturing device to close a wound or stitch tissues together. A variety of wound shapes and suturing methods can be added into the simulator. The code was written in C++ with object oriented programming (OOP), so that the core code can be reused, for example in a robotic simulation environment. OpenGL and GLSL APIs were used for the graphics and rendering. wxWidgets was used for creating the graphical user
Virtual Interrupted Suturing Exercise with the Endo Stitch Suturing Device
63
interface (GUI). Subsequently, we would like to create more realistic wound’s surface by applying graphics rendering techniques, for example, by adding textures and more complex rendering to it. The next major steps would be to create a surgical simulation framework by extending the developed simulator into a robotic simulation environment and add special effects, such as blood and smoke, into the created framework.
References 1. Pattaras, J.G., Smith, G.S., Landman, J., Moore, R.G.: Comparison and analysis of laparoscopic intracorporeal suturing devices: preliminary results. Journal of Endourology 15, 187–192 (2001) 2. Sissener, T.: Suture patterns. Companion Animal 11, 14–19 (2006) 3. Punak, S., Kurenov, S.: A simulation framework for wound closure by suture for the endo stitch suturing instrument. In: Proceedings of Medicine Meets Virtual Reality (MMVR) 18. Studies in Health Technology and Informatics (SHTI), Long Beach, CA vol. 163, pp. 461–465. IOS Press, Amsterdam (2011) 4. Kurenov, S., Punak, S., Kim, M., Peters, J., Cendan, J.C.: Simulation for training with the autosuture endo stitch device. Surgical Innovation 13, 1–5 (2006) 5. Nesme, M., Kry, P.G., Jeˇr´ abkov´ a, L., Faure, F.: Preserving topology and elasticity for embedded deformable models. ACM Trans. Graph. 28, 1–9 (2009) 6. Baraff, D., Witkin, A.: Large steps in cloth simulation. In: SIGGRAPH 1998: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pp. 43–54. ACM, New York (1998) 7. Spillmann, J., Teschner, M.: CoRdE: Cosserat rod elements for the dynamic simulation of one-dimensional elastic objects. In: SCA 2007: Proceedings of the 2007 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 63–72. Eurographics Association, Aire-la-Ville (2007) 8. Punak, S., Kurenov, S.: Simplified cosserat rod for interactive suture modeling. In: Proceedings of Medicine Meets Virtual Reality (MMVR) 18. Studies in Health Technology and Informatics (SHTI), Long Beach, CA, vol. 163, pp. 466–472. IOS Press, Amsterdam (2011) 9. Kubiak, B., Pietroni, N., Ganovelli, F., Fratarcangeli, M.: A robust method for realtime thread simulation. In: VRST 2007: Proceedings of the 2007 ACM Symposium on Virtual Reality Software and Technology, pp. 85–88. ACM, New York (2007) 10. Brown, J., Latombe, J.C., Montgomery, K.: Real-Time Knot-Tying Simulation. The Visual Computer 20(2-3), 165–179 (2004) 11. Berkley, J., Turkiyyah, G., Berg, D., Ganter, M., Weghorst, S.: Real-time finite element modeling for surgery simulation: An application to virtual suturing. IEEE Transactions on Visualization and Computer Graphics 10, 314–325 (2004) 12. Coles, T.R., Meglan, D., John, N.W.: The role of haptics in medical training simulators: A survey of the state of the art. IEEE Transactions on Haptics 4, 51–66 (2011)
New Image Steganography via Secret-Fragment-Visible Mosaic Images by Nearly-Reversible Color Transformation Ya-Lin Li1 and Wen-Hsiang Tsai2,3 1
Institute of Computer Science and Engineering, National Chiao Tung University, Taiwan 2 Department of Computer Science, National Chiao Tung University, Taiwan 3 Department of Information Communication, Asia University, Taiwan
Abstract. A new image steganography method is proposed, which creates automatically from an arbitrarily-selected target image a so-called secretfragment-visible mosaic image as a camouflage of a given secret image. The mosaic image is yielded by dividing the secret image into fragments and transforming their color characteristics to be those of the blocks of the target image. Skillful techniques are designed for use in the color transformation process so that the secret image may be recovered nearly losslessly. The method not only creates a steganographic effect useful for secure keeping of secret images, but also provides a new way to solve the difficulty of hiding secret images with huge data volumes into target images. Good experimental results show the feasibility of the proposed method.
1
Introduction
Steganography is the science of hiding secret messages into cover media so that no one can realize the existence of the secret data [1-2]. Existing steganography techniques may be classified into three categories ⎯ image, video, and text steganographies, and image steganography aims to embed a secret message into a cover image with the yielded stego-image looking like the original cover image. Many image steganography techniques have been proposed [1-4], and some of them try to hide secret images behind other images [3-4]. The main issue in these techniques is the difficulty to hide a huge amount of image data into the cover image without causing intolerable distortions in the stego-image. Recently, Lai and Tsai [5] proposed a new type of computer art image, called secret-fragment-visible mosaic image, which is the result of random rearrangement of the fragments of a secret image in disguise of another image called target image, creating exactly an effect of image steganography. The above-mentioned difficulty of hiding a huge volume of image data behind a cover image is solved automatically by this type of mosaic image. In more detail, as illustrated by Fig. 1, a given secret image is first “chopped” into tiny rectangular fragments, and a target image with a similar color distribution is selected from a database. Then, the fragments are arranged in a random fashion controlled by a key to fit into the blocks of the target image, yielding a stego-image with a mosaic appearance. The stego-image preserves all the secret G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 64–74, 2011. © Springer-Verlag Berlin Heidelberg 2011
New Image Steganography via Secret-Fragment-Visible Mosaic Images
65
image fragments in appearance, but no one can figure out what the original secret image looks like. The method is a new way for secure keeping of secret images. However, a large image database is required in order to select a color-similar target image for each input secret image, so that the generated mosaic image can be sufficiently similar to the selected target image. Using their method, a user is not allowed to select freely his/her favorite image for use as the target image.
Fig. 1. Illustration of creation of secret-fragment-visible mosaic image proposed in [5]
Accordingly, we propose in this study a new method that creates secret-fragmentvisible mosaic images with no need of a database; any image may be selected as the target image for a given secret image. Fig. 2 shows a result yielded by the proposed method. Specifically, after a target image is selected arbitrarily, the given secret image is first divided into rectangular fragments, which then are fit into similar blocks in the target image according to a similarity criterion based on color variations. Next, the color characteristic of each tile image is transformed to be that of the corresponding block in the target image, resulting in a mosaic image which looks like the target image. Such a type of camouflage image can be used for securely keeping of a secret image in disguise of any pre-selected target image. Relevant schemes are also proposed to conduct nearly-lossless recovery of the original secret image.
(a)
(b)
(c)
Fig. 2. A result yielded by proposed method. (a) Secret image. (b) Target image. (c) Secretfragment-visible mosaic image created from (a) and (b).
In the remainder of this paper, the idea of the proposed method is described in Sections 2 and 3. Detailed algorithms for mosaic image creation and secret image
66
Y.-L. Li and W.-H. Tsai
recovery are given in Section 4. In Section 5, experimental results are presented to show the feasibility of the proposed method, followed by conclusions in Section 6.
2
Basic Idea of Proposed Method
The proposed method includes two main phases: mosaic image creation and secret image recovery. The first phase includes four stages: (1) stage 1.1 fitting the tile images of a given secret image into the target blocks of a pre-selected target image; (2) stage 1.2 transforming the color characteristic of each tile image in the secret image to become that of the corresponding target block in the target image; (3) stage 1.3 rotating each tile image into a direction with the minimum RMSE value with embedding relevant respect to its corresponding target block; and (4) stage 1.4 information into the created mosaic image for future recovery of the secret image. The second phase includes two stages: (1) stage 2.1 extracting the embedded information for secret image recovery from the mosaic image; and (2) stage 2.2 recovering the secret image using the extracted information.
-
-
-
-
3
-
-
Problems and Proposed Solutions for Mosaic Image Creation
The problems encountered in generating mosaic images by the proposed method are discussed in this section, and the proposed solutions to them are also presented. (A) Color Transformations between Blocks Suppose that in the first phase of the proposed method, a tile image T in a given secret image is to be fit into a target block B in a pre-selected target image. Since the color characteristics of T and B are different from each other, how to change their color distributions to make them look alike is the main issue here. Reinhard et al. [6] proposed a work about color transfer in this aspect, which converts the color characteristic of one image to be that of another in the lαβ color space. This idea is an answer to the issue and is adopted in this study. But instead of conducting color conversion in the lαβ color space, we do it in the RGB space to reduce the volume of the generated information which should be embedded in the created mosaic image for later recovery of the original secret image. More specifically, let T and B be described as two pixel sets {p1, p2, …, pn} and {p1′, p2′, …, pn′}, respectively, assuming that both blocks are of the same dimensions with size n. Let the color of pixel pi in the RGB color space be denoted by (ri, gi, bi) and that of pi′ by (ri′, gi′, bi′). First, we compute the means and standard deviations of T and B, respectively, in each of the three color channels R, G, and B by the following formulas:
μc =
1 n 1 n ci , μc' = ∑ ci' ; ∑ n i =1 n i =1
(1)
σ c = (1/ n)∑ i =1 (ci − μc )2 , σ c' = (1/ n)∑ i =1 (ci' − μc' ) 2 n
n
(2)
New Image Steganography via Secret-Fragment-Visible Mosaic Images
67
where ci and ci′ denote the C-channel values of pixels pi and pj′, respectively, with c denoting r, g, b. Next, we compute new color values (ri′′, gi′′, bi′′) for each pi in T by:
ci'' = (σ c' / σ c )(ci − μc ) + μc' with c = r, g, and b.
(3)
This results in a new tile image T ′ with a new color characteristic similar to that of target block B. Also, we use the following formula, which is the inverse of Eq. (3), to compute the original color values (ri, gi, bi) of pi from the new ones (ri′′, gi′′, bi′′): ci = (σ c / σ c' )(c''i − μc' ) + μc with c = r, g, and b.
(4)
Furthermore, we have to embed into the created mosaic image sufficient information about the transformed tile image T′ for use in later recovery of the original secret image. For this, theoretically we can use Eq. (4) to compute the original pixel value of pi. But the mean and standard deviation values are all real numbers, and it is not practical to embed real numbers, each with many digits, in the generated mosaic image. Therefore, we limit the numbers of bits used to represent a mean or a standard deviation. Specifically, for each color channel we allow each of the means of T and B to have 8 bits with values 0 ~ 255, and the standard deviation quotient qc = σc′/σc to have 7 bits with values 0.1 ~ 12.8. We do not allow qc to be 0 because otherwise the original pixel value cannot be recovered back by Eq. (4) for the reason that σc/σc′ = 1/qc in Eq. (4) is not defined when qc = 0, where c = r, g, b. (B) Choosing Appropriate Target Blocks and Rotating Blocks to Fit Better
In transforming the color characteristic of a tile image T to be that of a corresponding target block B as described above, how to choose an appropriate B for each T (i.e., how to fit each T to a proper B) is an issue. If two blocks are more similar in color distributions originally, a better transformation effect will result. For this, we use the standard deviation of block colors as a measure to select the most similar target block B for each tile image T. First, we compute the standard deviations of every tile image and target block for each color channel. Then, we sort all the tile images to form a sequence, Stile, and all the target blocks to form another, Starget, according to the mean of the standard deviation values of the three colors. Finally, we fit the first tile image in Stile to the first target block in Starget; fit the second in Stile to the second in Starget, etc. Additionally, after a target block B is chosen for fitting a tile image T and after the color characteristic of T is transformed to be that of B as described above, we conduct a further improvement on the color similarity between the transformed T (denoted as T′) and B by rotating T′ into one of the four directions 0o, 90o, 180o and 270o, which yields a rotated version T′′ of T′ with the minimum RMSE value with respect to B among the four directions for final use to fit T into B. Fig. 3 shows an example of the result of applying this scheme to the secret image and target image shown in Figs. 3(a) and 3(b), respectively. Fig. 3(c) is the mosaic image created without applying this block rotation scheme and Fig. 3(d) is that created instead. We can see that Fig. 3(d) has a better fitting result with a smaller RMSE value than that of Fig. 3(c).
68
Y.-L. Li and W.-H. Tsai
(a)
(b)
(c)
(d)
Fig. 3. Illustration of effect of rotating tile images before fitting them into target blocks. (a) Secret image. (b) Target image. (c) Mosaic image created from (a) and (b) without block rotations (with RMSE = 36.911 with respect to (b)). (d) Mosaic image created from (a) and (b) with block rotations (with RMSE = 32.382).
(C) Handling Overflows/Underflows in Color Transformation
After the color transformation process between a tile image T and a target block B is conducted as described before, some pixel values in the transformed block T ′ might have overflows or underflows. To deal with this problem, we convert such values to be non-overflow/non-underflow ones and record the value differences as residuals for use in later recovery of the exact pixel values. Specifically, we convert all the transformed pixel values in T ′ not smaller than 255 to be 255, and all of those not larger than 0 to be 0. Next, we compute the differences between the original pixel values and the converted ones, 255 or 0, as the residuals and record them as information associated with T′. But as can be seen from Eq. (3), the bounds of possible residual values are unknown, and this causes a problem in deciding how many bits should be used to record a residual. To solve this problem, we record the residuals in the un-transformed color space rather than in the transformed one. That is, by using the following two formulas we compute first the smallest possible color value cS (with c = r, g, and b) in tile image T that becomes larger than 255 as well as the largest possible value cL in T that becomes smaller than 0, after the color transformation process has been conducted, as: cS = ⎡⎢ (1/ qc )(255 − cμ' ) + cμ ⎤⎥ ; cL = ⎢⎣ (1/ qc )(0 − cμ' ) + cμ ⎥⎦ ,
(5)
respectively, where qc = σc′/σc as defined before. Then, for an un-transformed value ci which becomes an overflow after the color transformation, we compute its residual as |ci − cS|; and for an un-transformed ci which becomes an underflow, we compute its residual as |cL − ci|. Now, the possible values for the residuals of ci are all in the range of 0 ~ 255, therefore we can simply record each of them with 8 bits. (D) Embedding Secret Image Recovery Information
In order to recover the secret image from the mosaic image, we have to embed relevant recovery information into the mosaic image. For this, we adopt a technique
New Image Steganography via Secret-Fragment-Visible Mosaic Images
69
of reversible contrast mapping proposed by Coltuc and Chassery [7], which is applied to the least significant bits of the pixels in the created mosaic image to hide data. The information required to recover a tile image T which is mapped to a target block B includes: (1) the index of B; (2) the optimal rotation angle of T; (3) the means of T and B and the related standard deviation quotients of all color channels; and (4) the overflow/underflow residuals. These data are coded by binary strings respectively as t1t2…tm, r1r2, m1m2…m48, q1q2…q21, and r1…rk, which together with the binary strings for encoding the values m and k are concatenated into a bit stream M for tile image T. Then, such bit streams of all the tile images are concatenated in order further into a total bit stream Mt for the entire secret image. Moreover, in order to protect Mt from being attacked, we encrypt it with a secret key to obtain an encrypted bit stream Mt′, which finally is embedded into pixel pairs in the mosaic image using the method proposed in [7]. A plot of the statistics of the numbers of required bits for embedding Mt′ into the generated mosaic images shown in this paper is shown in Fig. 6(b). After embedding the bit stream Mt′ into the mosaic image, we can recover the secret image back. But some loss will be incurred in the recovered secret image (i.e., the recovered image is not all identical to the original one). The loss occurs in the color transformation process using Eq. (3) where each pixel’s color value ci is multiplied by the standard deviation quotient qc = σc/σc′ and the resulting real value ci′′ is truncated to be an integer in the range of 0 through 255. However, because each truncated part is smaller than the value of 1 when no overflow or underflow occurs, the recovered value of ci using Eq. (4) is still precise enough. Even when overflows/underflows occur at some pixels in the color transformation process, we record their residual values as described previously and after using Eq. (4) to recover the pixel value ci, we can add the residual values back to the computed pixel values ci to get the original exact pixel data, yielding a nearly-lossless recovered secret image. According to our experimental results, each recovered secret image has a high PSNR value in the range of 45~50 db with respect to the original secret image, or equivalently, has very a small RMSE value around just 1.0 with respect to the original secret image, as will be shown later in Section 5.
4
Mosaic Image Creation and Secret Image Recovery Algorithms
Based on the above discussions, detailed algorithms for mosaic image creation and secret image recovery may now be described. Algorithm 1. Secret-fragment-visible mosaic image creation. Input: a secret image S with n tile images of size NT; a pre-selected target image T of the same size of S; and a secret key K. Output: a secret-fragment-visible mosaic image F. Steps: Stage 1.1 fitting tile images into target blocks. 1. Divide secret image S into a sequence of n tile images of size NT, denoted as Stile = {T1, T2, …, Tn}; and divide target image T into another sequence of n target blocks also with size NT, denoted as Starget = {B1, B2, …, Bn}.
-
70
Y.-L. Li and W.-H. Tsai
2. Compute the means (μr, μg, μb) and the standard deviations (σr, σg, σb) of each Ti in Stile for the three color channels according to Eqs. (1) and (2); and compute the average standard deviation σΤi = (σr + σg + σb)/3 for Ti where i = 1 through n. 3. Do similarly to the last step to compute the means (μr′, μg′, μb′), the standard deviations (σr′, σg′, σb′), and the average standard deviation σBj = (σr′ + σg′ + σb′)/3 for each Bj in Starget where j = 1 through n. 4. Sort the blocks in Stile and Starget according to the average standard deviation values of the blocks; map in order the blocks in the sorted Stile to those in the sorted Starget in a 1-to-1 manner; and reorder the mappings according to the indices of the tile images into a mapping sequence L of the form of T1 → Bj1, T2 → Bj2, etc. 5. Create a mosaic image F by fitting the tile images of secret image S to the corresponding target blocks of target image T according to mapping sequence L. performing color conversion between the tile images and target blocks. Stage 1.2 6. For each pair Ti → Bji in mapping sequence L, let the means μc and μc′ of Ti and Bj respectively be represented by 8 bits with values 0~255 and the standard deviation quotients qc = σc′/σc by 7 bits with values 0.1~12.8 where c = r, g, b. 7. For each pixel pi in each tile image Ti of mosaic image F with color value ci where c = r, g, b, transform ci into a new value ci′′ by Eq. (3); and if ci′′ is not smaller than 255 (i.e., if an overflow occurs) or if it is not larger than 0 (i.e., if an underflow occurs), assign ci′′ to be 255 or 0, respectively, and compute a residual value for pixel pi by the way described in Section 3(C). rotating the tile images. Stage 1.3 8. Compute the RMSE values of each color-transformed tile image Ti in F with respect to its corresponding target block Bj after rotating Ti into the directions 0o, 90o, 180o and 270o; and rotate Ti into the optimal direction θo with the smallest RMSE value. embedding the secret image recovery information. Stage 1.4 9. For each tile image Ti in F, construct a bit stream Mi for recovering Ti as described in Section 3(D), including the bit-segments which encode the data items of: (1) the index of the corresponding target block Bji; (2) the optimal rotation angle θο of T i ; (3) the means of Ti and Bji and the related standard deviation quotients of all color channels; (4) the overflow/underflow residual values in Ti; (5) the number m of bits to encode the index of a block; and (6) the number k of residual values. 10. Concatenate the bit streams Mi of all Ti in F in a raster-scan order to form a total bit stream Mt; use the secret key K to encrypt Mt into another bit stream Mt′; and embed Mt′ into F by reversible contrast mapping [7].
-
i
-
i
-
Algorithm 2. Secret image recovery. Input: a mosaic image F with n tile images and the secret key K used in Algorithm 1. Output: the secret image S embedded in F using Algorithm 1. Steps: Stage 2.1 extracting the secret image recovery information. 1. Extract from mosaic image F the bit stream Mt′ for secret image recovery by a reverse version of the reversible contrast mapping scheme proposed in [7] and decrypt Mt′ using the secret key K into a non-encrypted version Mt.
-
New Image Steganography via Secret-Fragment-Visible Mosaic Images
71
2. Decompose Mt into n bit streams Mi for the n to-be-constructed tile images Ti in S, respectively, where i = 1 through n. 3. Decode the bit stream Mi of each tile image Ti to obtain the following data: (1) the index ji of the block Bji in F corresponding to Ti; (2) the optimal rotation angle θο of Ti; (3) the means of Ti and Bji and the related standard deviation quotients of all color channels; (4) the overflow/underflow residual values in Ti; (5) the number m of bits to encode the index of a block; and (6) the number k of residual values. Stage 2.2 recovering the secret image. 4. Recover one by one in a raster-scan order the tile images Ti, i = 1 through n, of the desired secret image S by the following steps: (1) rotate the block indexed by ji, namely Bji, in F through the optimal angle θο and fit the resulting content into Ti to form an initial tile image Ti; (2) use the extracted means and related standard deviation quotients to recover the original pixel values in Ti according to Eq. (4); (3) use the extracted means, standard deviation quotients, and Eqs. (5) to compute the two parameters cS and cL; and (4) scan Ti to find out pixels with values 255 or 0 which indicate that overflows/underflows have occurred there, and add respectively the values cS or cL to the corresponding residual values of the found pixels, resulting in a final tile image Ti. 5. Compose all the final tile images to form the desired secret image S as output.
-
The time complexity of Algorithm 1 is O(nlogn) because the running time is dominated by Step 4: sorting the blocks in Stile and Starget. And the time complexity of Algorithm 2 is O(nNT) because it just extracts the embedded information and recovers the secret image back with the extracted data.
5
Experimental Results
An experimental result is shown in Fig. 4, where 4(c) shows the created mosaic image using Fig. 4(a) of size 1024×768 as the secret image and Fig. 4(b) of the same size as the target image. The tile image size is 8×8. The recovered secret image using a correct key is shown in Fig. 4(d) which is quite similar to the original secret image shown in Fig. 4(a). It has PSNR = 48.597 and RMSE = 0.948 with respect to the secret image. In fact, it is difficult for a human to feel the difference between two images when the PSNR is larger than 30 or when the RMSE is close to 1.0. It is noted by the way that all other experimental results shown in this paper have PSNR values larger than 47 and RMSE values close to 1.0, as seen in Figs. 6(c) and 6(d). Back to discussions on the results shown in Fig. 4, Fig. 4(e) shows the recovered secret image using a wrong key, which is a noise image. Figs. 4(f) through 4(h) show more results using different tile image sizes. It can be seen from the figures that the created mosaic image retains more details of the target image when the tile images are smaller. Fig. 6(a) also shows this fact in a similar way ⎯ mosaic images created with smaller tile image sizes have smaller RMSE values with respect to the target image. However, even when the tile image size is large (e.g., 32×32), the created mosaic image still looks quite similar to the target image. On the other hand, the number of required bits embedded for recovering the secret image is increased when the tile image becomes smaller, as can be seen from Fig. 6(b).
72
Y.-L. Li and W.-H. Tsai
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4. An Experimental result of secret-fragment-visible mosaic creation. (a) Secret image. (b) Target image. (c) Mosaic image created with tile image size 8×8. (d) Recovered secret image using a correct key with PSNR = 48.597 and with RMSE =0.948 with respect to secret image (a). (e) Recovered secret image using a wrong key. (f)-(h) Mosaic images created with different tile-image sizes 16×16, 24×24, 32×32.
Fig. 5 shows a comparison of the results yielded by the proposed method and by the method proposed by Lai and Tsai [5], where Figs. 5(a) and 5(f) are the input secret images and Figs. 5(b) and 5(g) are the selected target images; Figs. 5(c) and 5(h) were created by Lai and Tsai [5]; and Figs 5(d) and 5(i) were created by the proposed method. Also, Figs. 5(e) and 5(j) show the recovered secret images. It can be seen that the created mosaic images yielded by the proposed method have smaller RMSE values with respect to the target images, implying that they are more similar to the target images. And more importantly, the proposed method allows users to select their favorite images for uses as target images. This provides great flexibility in practical applications without the need to maintain a target image database which usually is very large if mosaic images with high similarities to target images are to be generated. By the way, it is noted that both the recovered secret images shown in Figs. 5(e) and 5(j) also have RMSE values close to 1.0 with respect to the respective secret images, saying they are very close to the original secret images in appearance. Moreover, we conducted experiments on a large data set with 127 different secret image and target image pairs, and the result is included in Fig. 6 (as orange curves).
New Image Steganography via Secret-Fragment-Visible Mosaic Images
(a)
(c)
(b)
(d)
(e)
(f)
(h)
73
(g)
(i)
(j)
Fig. 5. Comparison of results of Lai and Tsai [5] and proposed method. (a) Secret image. (b) Target image. (c) Mosaic image created by method proposed by Lai and Tsai [5] with RMSE=47.651. (d) Mosaic image created by proposed method with RMSE = 33.935. (e) Recovered secret image with RMSE=0.993 with respect to secret image (a). (f) Secret image of another experiment. (g) Target image. (h) Mosaic image created by Lai and Tsai [5] with RMSE=38.036. (i) Mosaic image created by proposed method with RMSE=27.084. (j) Recovered secret image with RMSE=0.874 with respect to secret image (f).
6
Conclusions
A new image steganography method has been proposed, which not only can be used for secure keeping of secret images but also can be a new option to solve the difficulty of hiding images with huge data volumes behind cover images. By the use of proper pixel color transformation as well as skillful handling of overflows/underflows in the converted pixels’ colors, secret-fragment-visible mosaic images of high similarities to arbitrarily-selected target images can be created with no need of a target image database, and the original secret images can be recovered nearly losslessly from the created mosaic images. Good experimental results have shown the feasibility of the proposed method. Future studies may be directed to applying the proposed method to images of color models other than the RGB.
74
Y.-L. Li and W.-H. Tsai
1800000
50 45
RMSE
35
Fig. 2
30
Fig. 3
25
Fig. 4
20
Fig. 5(a)
15
Fig. 5(f)
10
Large dataset
Required bits
1600000
40
1400000
Fig. 5(a)
600000
Fig. 5(f)
400000
0
0 24x24
Fig. 4
800000
200000 16x16
Fig. 3
1000000
5 8x8
Fig. 2
1200000
Large dataset
8x8
32x32
16x16
(a)
32x32
1.2
50 49.5
1
49
Fig. 2
48.5
Fig. 3
48
Fig. 4 Fig. 5(a)
47.5
Fig. 5(f)
47
Large dataset
46.5 46
Fig. 2 0.8
RMSE
PSNR
24x24
(b)
Fig. 3 Fig. 4
0.6
Fig. 5(a) 0.4
Fig. 5(f) Large dataset
0.2 0
8x8
16x16
24x24
(c)
32x32
8x8
16x16
24x24
32x32
(d)
Fig. 6. Plots of trends of various parameters versus different tile image sizes (8×8, 16×16, 24×24, 32×32) with input secret images all shown previously and a large data set with 127 different secret image and target image pairs. (a) RMSE values of created mosaic images with respect to target images. (b) Numbers of required bits embedded for recovering secret images. (c) PSNR values of recovered secret images with respect to original ones. (d) RMSE values of recovered secret images with respect to original ones.
References 1. Bender, W., Gruhl, D., Morimoto, N., Lu, A.: Techniques for Data Hiding. IBM System Journal 35, 313–336 (1996) 2. Petitcolas, F.A.P., Anderson, R.J., Kuhn, M.G.: Information Hiding - a Survey. Proceedings of IEEE 87(7), 1062–1078 (1999) 3. Thien, C.C., Lin, J.C.: A Simple and High-hiding Capacity Method for Hiding Digit-bydigit Data in Images Based on Modulus Function. Pattern Recognition 36, 2875–2881 (2003) 4. Wang, R.Z., Chen, Y.S.: High-payload Image Steganography Using Two-way Block Matching. IEEE Signal Processing Letters 13(3), 161–164 (2006) 5. Lai, I.J., Tsai, W.H.: Secret-fragment-visible Mosaic Image -A New Computer Art and Its Application to Information Hiding. Accepted and to Appear in IEEE Transactions on Information Forensics and Security (2011) 6. Reinhard, E., Ashikhmin, M., Gooch, B., Shirley, P.: Color Transfer between Images. IEEE Computer Graphics and Applications 21(5) (2001) 7. Coltuc, D., Chassery, J.-M.: Very Fast Watermarking by Reversible Contrast Mapping. IEEE Signal Processing Letters 14(4), 255–258 (2007)
Adaptive and Nonlinear Techniques for Visibility Improvement of Hazy Images Saibabu Arigela and Vijayan K. Asari Computer Vision and Wide Area Surveillance Laboratory Department of Electrical and Computer Engineering, University of Dayton, Dayton, Ohio
Abstract. In outdoor video processing systems, the image frames of a video sequence are usually subjected to poor visibility and contrast in hazy or foggy weather conditions. A fast and efficient technique to improve the visibility and contrast of digital images captured in such environments is proposed in this paper. The image enhancement algorithm constitutes three processes viz. dynamic range compression, local contrast enhancement and nonlinear color restoration. We propose a nonlinear function to modify the wavelet coefficients for dynamic range compression and uses an adaptive contrast enhancement technique in wavelet domain. A nonlinear color restoration process based on the chromatic information of the input image frame is applied to convert the enhanced intensity image back to a color image. We also propose a model based image restoration approach which uses a new nonlinear transfer function on luminance component to obtain the transmission map. Experimental results show better visibility compared to those images enhanced with other state of art techniques.
1 Introduction In recent days the number of outdoor cameras used for various applications like traffic monitoring, weather observation, video surveillance, security and law enforcement has been proliferated. The images captured by these cameras in bad weather conditions suffer from poor visibility and they adversely impact the performance of the vision systems. So, in image processing and computer vision field, improving the visibility and features of weather degraded images has been an area of considerable attention and research. Human eye can view the scenes that possess dynamic range much greater than that captured by the conventional display devices. When we compare an eye’s pupil and a camera’s aperture, the later has the limitation of being fixed when a scene is captured, whereas the former has the freedom of allocating various intensity levels to various parts of a scene. Hence when displaying a high dynamic range image, using a display device, results in a locally poor contrast image. There are some exceptions such as those in bad weather conditions like haze, fog, snow and rain where the captured images and the direct observation exhibit a close parity [1].The extreme narrow dynamic range of such scenes leads to extreme low contrast in the captured images. Many image processing algorithms were developed to deal with the images captured in such poor weather conditions. The conventional techniques are histogram G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 75–84, 2011. © Springer-Verlag Berlin Heidelberg 2011
76
S. Arigela and V.K. Asari
equalization, local histogram equalization and adaptive histogram equalization. Contrast Limited Adaptive Histogram Equalization (CLAHE) proposed by Pizer [2]. limits the noise enhancement by establishing a maximum value. It is successful for medical imaging applications but are not effective on degraded color images. Retinex based algorithms are efficient techniques for dynamic range compression and color constancy. Jabson et al [3] proposed a method named MSRCR (Multi Scale Retinex with Color Restoration), which can evidently enhance the dark region of input image, but has poor performance on severely fogged images. Coming to physics or optics based models, scattering of additive light caused by haze or fog particles is termed as air light and its effect increases exponentially with the distance and degrades the visibility in the captured image with poor contrast and distorted color [4]. Narasimhan and Nayar [5] estimated the properties of the transmission medium by analyzing multiple images of same scene taken in different weather conditions. Under the assumption that the transmission and surface shading are locally uncorrelated, Fattal [6] used the single image to estimate the albedo of the scene and then infers the medium transmission. By observation, haze-free image must have higher contrast compared with the input haze image, Tan [7] removes the haze by maximizing the local contrast of the restored image. He [8] observed that the haze-free outdoor images contain some pixels which have very low intensities of local patches in at least one color channel. This statistical observation is called dark channel prior and is used to remove the haze in an image. These methods found to be slow for real time applications. The proposed image enhancement and image restoration based techniques require less processing time. They provide dynamic range compression preserving the local contrast and tonal rendition which is a good candidate for improving the performance of outdoor video processing systems. This paper is organized as follows; section 2 describes the proposed wavelet based image enhancement algorithm and section 3 describes the model based image restoration technique. Experimental results and analysis are described in section 4 and the conclusions in section 5.
2 Nonlinear Technique for Image Enhancement This algorithm for the enhancement of hazy images consists of three major constituents, namely dynamic range compression, adaptive contrast enhancement and nonlinear color restoration. The first two processes are performed in wavelet domain and the third one in spatial domain. A descriptive block schematic representation of the proposed algorithm is shown in fig.1. The original color image is converted to intensity image using NTSC standard method as defined as I(x, y) =
76.245 × R + 149.6851× G + 29.07 × B
.
(1)
255
where R,G,B are red, green and blue components respectively. 2.1 Wavelet Based Dynamic Range Compression We choose discrete wavelet transform for dimensionality reduction such that a dynamic range compression with local contrast enhancement is performed only to the
Adaptive and Nonlinear Techniques for Visibility Improvement of Hazy Images
77
approximation coefficients. These are obtained by low-pass filtering and downsampling the original intensity image. First, the intensity image is decomposed using orthonormal wavelet transform as in Eq.(2) h h I(x, y) = ∑ a J,k,lφ J, k,l (x, y) + ∑ ∑ d j,k,lϕ j,k,l (x, y) k,l∈z j≥ J k,l∈z
.
(2)
v v d d + ∑ ∑ d j,k,lϕ j,k,l (x, y) + ∑ ∑ d j,k,lϕ j,k,l (x, y) j≥ J k,l∈z j≥ J k,l∈z
where a J, k, l are the approximation coefficients at scale J with corresponding scaling functions φ J, k, l and d j,k,l are the detail coefficients at each scale with corresponding wavelet functions ϕ j,k,l (x, y). Input color image
Intensity image DWT
Approximation coeff. - A Normalized coeff.- A ′
cntr
/A Modified detail coefficients IDWT
Mapped coeff.- A
m
A/A
A
Nonlinear transfer function
Local average -A
Detail coeff. -D
m
Local contrast enhancement De-normalized coeff.- A cntr
Nonlinear color restoration
Output color image
Fig. 1. Block diagram of the proposed algorithm
Multi-windowed inverse sigmoid. The approximation coefficients a J, k, l at scale J are normalized to the range [0 10] and used to map to a range [0 1] using a specifically designed nonlinear function with parameters α and β as given in Eq. (3)
78
S. Arigela and V.K. Asari
a J, k, l = 1+ e
1 − α. a ′
+
J, k, l
1+ e
1 . −β.( a J,′ k, l −10)
(3)
where a ′J, k, l are normalized coefficients obtained as a ′J, k, l =
1 a J, k, l 25.5
.
(4)
J
α and β are the curvature parameters which tune the shape of two-sided multiwindowed inverse sigmoid. The non linearity for various values of α and β is depicted in fig.2 (a). The value of α improves the brightness for low lighting regions and β pulls down the lightness caused by haze or fog. We proposed this nonlinear function in spatial domain [9] for enhancing the images in non-uniform lighting conditions. The wavelet coefficients modification for contrast enhancement was proposed for medical imaging applications in [11].
(a) MWIS function
(b) Sine nonlinear function
Fig. 2. Proposed nonlinear functions
Sine nonlinear function. The approximation coefficients are normalized to a range [0 1] and mapped to the same range using sine nonlinear function as given in Eq. (5) q
a J, k, l = Sin (a ′J, k, l π./2) . 2
(5)
where a′J, k,l are normalized coefficients obtained as a′J, k, l =
1 a J, k, l 255
.
(6)
J
The q value acts well in pulling down the high intensity values which are caused by haze or fog. Applying one of the mapping operators (nonlinear functions) to the approximation coefficients and taking the inverse wavelet transform would result in a compressed dynamic range with a significant loss of contrast. 2.2 Adaptive Local Contrast Enhancement
The local contrast enhancement is based on the multi-scale Gaussian neighborhood with original intensity image which is obtained in wavelet domain by local averaging
Adaptive and Nonlinear Techniques for Visibility Improvement of Hazy Images
79
of the original intensity image. Lets denote a J, k, l as A, a ′J, k,l as A′ , a J, k, l as A and the m
corresponding normalized local mean of approximation coefficients as A .The contrast enhanced coefficients A cntr which will replace the original coefficients A are obtained as p
J
A cntr = 255(A) 2 .
(7)
The parameter p is adaptively estimated based on the neighborhood mean coefficients m
A as given in Eq. (8) p=
A
m
4(1 − A
m
+σ)
+ε .
(8)
where A = A′ ∗ G and G is the multi-scale Gaussian function, ε and cal parameters. m
σ are empiri-
Detail coefficients modification. The detail coefficients are modified using the ratio between the enhanced and original approximation coefficients. This ratio is applied as an adaptive gain mask such as: h
Dcntr =
A A Acntr h v v d d D ; Dcntr = cntr D ; Dcntr = cntr D ; A A A
(9)
The inverse discrete wavelet transform is used to reconstruct the image with modified approximation and detail coefficients at level 1. If the wavelet decomposition is carried out for more than one level, this procedure is repeated at each level. 2.3 Nonlinear Color Restoration
A nonlinear color restoration approach given in Eq. (10) is employed to obtain the final color image. I enh, i = γ i I enh ;
δ
γ i = (I i (x, y)/(max(I i (x, y)) )) .
(10)
where I enh is the reconstructed image with modified approximation and detail coefficients. I enh,i are the r,g,b values of the enhanced color image. I i (x, y) are the r,g,b values of input color image. δ is the canonical gain factor which increases the color saturation results in more appealing color rendition. Since the coefficients are normalized during the enhancement process, the enhanced intensity image obtained by the inverse transform of enhanced coefficients and the enhanced color image spans only at the lower half of the full range of the histogram. So, histogram clipping in the upper half and stretching the entire range in each channel give the best results in converting the output to display domain.
80
S. Arigela and V.K. Asari
3 Model Based Approach In computer vision, the optics or physics based model of image formation in bad weather is described in [10] is
I(x) = J(x)t(x) + A(1 − t(x)) .
(11)
where I is the input haze image, J is the restored haze-free image, A is the global atmospheric light and t is the transmission medium. The objective is to recover J, A and t from I. The term J(x)t(x) in Eq(11) is called the direct attenuation and the second term A(1 − t(x)) is called the airlight. The transmission t, in homogenous atmosphere is t(x) = exp(-kd) .
(12)
where k is the atmospheric attenuation coefficient and is the distance between an object in the image and the observer. He’s approach [8] uses dark channel prior to obtain the transmission map which is an alpha map with clear edge outline and depth layer of the scene objects. The proposed method uses a nonlinear transfer function shown in Eq (3) on the intensity component of the image to obtain the equivalent form of transmission map as in [8]. The value of is varied based on the luminance value of the pixel given in Eq.(13) and the value of is a constant (0.5). The luminance component is obtained by multi-scale Gaussian mean which preserves features by adjusting different scales.
⎫ ⎧ 0 .5 , L ≤ 50 ⎪ ⎪ L − 50 ⎪ . ⎪ + 0 . 5 , 50 < L ≤ 150 ⎬ α =⎨ ⎪ ⎪ 100 ⎪⎭ ⎪⎩1 . 5 , L > 150
(13)
where L is the luminance level corresponding to the cumulative distribution function (CDF) equal to 1. Global atmospheric light constant A can be obtained from the pixels which have the highest intensity in the transmission image and the corresponding R, G, B channels. In order to restore the details nearer the outline of scene objects median filter is applied to the modified transmission image. The advantage of this method over other model based methods is that it requires less processing time.
4 Results and Analysis The proposed algorithms were tested with several images which have hazy and moderate foggy regions. Based on several experiments, MWIS function parameters at α =0.6 and β = 0.9 and the single parameter q for sine nonlinear function at q = 1.6 provide better results. Both the functions provide good results for aerial hazy images. The algorithm works well for images captured in different outdoor hazy/foggy weather conditions. All the results shown in this paper are obtained with J=1, ε = 0.1389, σ = 0.1 .
Adaptive and Nonlinear Techniques for Visibility Improvement of Hazy Images
(a)
(b)
(c)
(d)
81
(e)
Fig. 3. Performance comparison: (a) Original image (b) AHE (c) MSRCR (d) MWIS (e) Sine nonlinear method
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. Performance comparison: (a) Original image (b) AHE (c) MSRCR (d) MWIS (e) Sine nonlinear (f) Proposed model based
Fig.3 and 4 shows the comparisons of the proposed algorithms with image enhancement techniques Adaptive Histogram equalization, MSRCR. MSRCR enhanced images are obtained using the auto levels with high contrast mode settings of PhotoFlair, a commercial software. Fig.3 shows a hazy region at the centre and non hazy regions at the sides. AHE has many color artifacts around the edges. MSRCR with autolevels only enhancing the side regions but the hazy region remain the same. MWIS, sine nonlinear function and proposed model based method perform well in improving the visibility in both regions. Second example is a scene with different depth regions and non hazy regions closer to camera. AHE and MSRCR with auto levels are good for non hazy regions. The proposed techniques with sine nonlinear and model based works well in this case. Fig. 4 shows the proposed sine nonlinear function and model based approach provides the features and visibility than that of other techniques. Fig.5 shows comparison with Fattal’s method which works well to restore good color but the artifacts are still there but the proposed approach performs well. Fig.6 shows the comparison with Tan’s method which works well for both the
82
S. Arigela and V.K. Asari
(a)
(b)
(c)
Fig. 5. Performance comparison with Fattal’s method [6]: (a) Original image (b) Fattal’s method (c) Proposed model based
(a)
(b)
(c)
Fig. 6. Performance comparison with Tan’s method [7]: (a) Original image (b) Tan’s method (c) Proposed model based
(a)
(b)
(c)
Fig. 7. Performance comparison with He’s method [8]: (a) Original image (b) He’s method (c) Proposed model based
(a)
(b)
(c)
(d)
(e)
Fig. 8. Performance comparison: (a) Original image (b) Fattal’s (c) Tan’s (d) He’s (e) proposed model based
Adaptive and Nonlinear Techniques for Visibility Improvement of Hazy Images
83
regions except some color artifacts. Fig.7 shows the comparison with He’s approach which has good features and color restoration and the proposed algorithm also works well in this case. Fig. 8 shows the comparison with all the three methods using two examples, the proposed model based method provides better performance than Fattal and Tan and equal performance as He’s approach. So, the enhanced images with MWIS function, sine nonlinear function performs well compared to existing image enhancement method and model based image restoration techniques provide better or equal visibility and rendition compared to the other techniques. Quantitative evaluation. To quantitatively asses the performance of these methods we use the visible edge segmentation method proposed in [12]. This method compares the input and restored gray level images using the indicators e (newly visible edges after restoration), (average visibility enhancement after enhanced) and (percentage of pixels completely black after restoration). The selection of visible edges in the image before and after enhanced is estimated with 5% contrast thresholding. The aim is to increase the contrast without losing some visual information. High values of e describes the good results. The comparisons are shown in and and low values of Table 1. Proposed sine nonlinear method has high values of e and among the tradiand values are almost tional techniques and proposed model based method’s e, equal to that of other existing model based methods. Table 1. Quantitative evaluation: visible edges, ratio of average gradient and percentage of pixels completely black after restoration
Original parameters MSRCR Fattal Tan He MWIS Sine nonlinear Model based
Figure 5(a) e 0.43 1.01
0.65 1.29
1.54 1.42 0.85 0.96
1.59 1.34 0.93 1.12
1.60 1.11 1.29 0.82 1.32 1.26
1.41
1.35
1.02
Figure 6(a) e 0.43 1.01
0.65 1.29
1.54 1.42 0.85 0.96
1.59 1.34 0.93 1.12
1.60 1.11 1.29 0.82 1.32 1.26
1.41
1.35
1.02
Figure 7(a) e 0.43 1.01
0.65 1.29
1.54 1.42 0.85 0.96
1.59 1.34 0.93 1.12
1.60 1.11 1.29 0.82 1.32 1.26
1.41
1.35
1.02
5 Conclusion A new wavelet based image enhancement technique to provide dynamic range compression with two nonlinear functions while preserving the local contrast and tonal rendition and a model based image restoration algorithm for haze/fog removal have been developed to improve the visual quality of the digital images captured in hazy/foggy weather conditions. The parameters provide flexibility in tuning the nonlinear curves for enhancing the different image frames of a video. These algorithms can be applied to improve the performance of video surveillance, object recognition in hazy or foggy environments. The results obtained from large variety of hazy/foggy
84
S. Arigela and V.K. Asari
images show strong robustness, high image quality, and improved visibility indicating promise for aerial imagery and video surveillance during poor weather conditions.
References [1] Jobson, D.J., Rahman, Z., Woodell, G.A., Hines, G.D.: A Comparison of Visual Statistics for the Image Enhancement of FORESITE Aerial Images with Those of Major Image Classes. In: Visual Information Processing XV, Proceedings of SPIE, vol. 6246, pp. 1–8 (2006) [2] Pizer, S.M.: Adaptive Histogram Equalization and Its Variations. In: Computer Vision, Graphics, and Image Processing, pp. 335–368 (1987) [3] Jabson, D.J., Rahman, Z., Woodel, G.A.: A multi-scale retinex for bridging the gap between color images and the human observation of scenes. IEEE Transactions on Image Processing, 965–976 (1997) [4] Oakley, J.P., Satherley, B.L.: Improving image quality in poor visibility conditions using a physical model for contrast degradation. IEEE Transactions on Image Processing, 165– 169 (1998) [5] Narasimhan, S.G., Nayar, S.K.: Contrast restoration of weather degraded images. IEEE Transactions on Pattern Analysis and Machine Learning 25(6), 713–724 (2003) [6] Fattal, R.: Single image dehazing. ACM Transactions of Graphics, SIGGRAPH 27, 1–9 (2008) [7] Tan, R.: Visibility in bad weather from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) [8] He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1956–1963 (2009) [9] Asari, K.V.K., Oguslu, E., Arigela, S.: Nonlinear enhancement of extremely high contrast images for visibility improvement. In: Kalra, P.K., Peleg, S. (eds.) ICVGIP 2006. LNCS, vol. 4338, pp. 240–251. Springer, Heidelberg (2006) [10] McCartney, E.J.: Optics of Atmosphere: Scattering by Molecules and Particles, pp. 23– 32. John Wiley and sons, New York (1976) [11] Laine, A.F., Schuler, S., Jian, F., Huda, W.: Mammographic feature enhancement by mul-tiscale analysis. IEEE Transactions on Medical Imaging 13(4) (1994) [12] Hautiere, N., Tarel, J.P., Aubert, D., Dumont, E.: Blind contrast enhancement assessment by gradient ratioing at visible edges. Image Analysis & Stereology Journal 27(2), 87–95 (2008)
Linear Clutter Removal from Urban Panoramas Mahsa Kamali1, Eyal Ofek2, Forrest Iandola1, Ido Omer2, and John C. Hart1 1
Univerrsity of Illinois at Urbana Champaign, USA 2 Microsoft Research, USA
Abstract. Panoramic images capture cityscapes of dense urban structures by mapping multiple imag ges from different viewpoints into a single composite image. One challenge to their construction is that objects that lie at different depth are often not stitched d correctly in the panorama. The problem is especially troublesome for objeccts occupying large horizontal spans, such as telephone wires, crossing multip ple photos in the stitching process. Thin lines, such as power lines, are comm mon in urban scenes but are usually not selected for registration due to their sm mall image footprint. Hence stitched panoramas of urban environments often incclude “dented” or “broken” wires. This paper presents an automatic scheme for detecting and removing such thin linear structures from ur results show significant visual clutter reduction from panoramic images. Ou municipal imagery wh hile keeping the original structure of the scene and visual perception of the imagery intact.
1 Introduction Multi-perspective panoram mic imaging produces visual summaries of scenes that are difficult to capture in a cam mera’s limited field of view. As a result, multi-perspecttive panoramas have seen increeasing popularity in navigation and sightseeing consum mer applications. For example, Microsoft Street Slide renders multi-perspective panooramas in real time, thus enab bling an interactive urban sightseeing experience [12]. We show an example Street Slid de urban panorama in Figure 1.
Fiig. 1. Panorama of a Long Street [12] G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 85–94, 2011. © Springer-Verlag Berlin Heidelberg 2011
86
M. Kamali et al.
Until automatic multi-peerspective panorama production methods were developped, panorama production typically relied on single-perspective, orthographic projectioons. In single-perspective panorramas, each point in the world is mapped to the closest point in the panorama’s plaane [Szeliski 2006]. As a result, single-perspective paanoramas suffer from the unn natural effect that far-away objects and close-up objeects appear at the same scale [P Pulli 2010]. This effect is particular apparent in long paanoramas of city streets. Mu ulti-perspective panoramas avoid this unnatural effectt by stitching images from dispaarate viewpoints in a panorama [Rav-Acha]. Each porttion of a multi-perspective pano orama looks like a natural-perspective view of the sceene, though the panorama as a whole w does not adhere to a single linear perspective [Aggarwala 2006, Vallance 2001].. In the last few years, thee computer vision community has made significant striides in automating the productio on of multi-perspective panoramas. In 2004, Roman ett al. developed a system that relied on some human interaction to produce muultiperspective panoramas [Roman et al. 2004]. By 2006, Roman and Lensch succeeded in automating this process [Roman and Lensch 2006]. Automatic multi-perspecttive panorama production involves stitching images together along seams that best meerge overlapping features [Szelisski 2006]. Toward this goal, stitching techniques prioriitize large objects with low dep pth variances (such as building facades), isolated objeects, and objects with small horiizontal spans (such as poles and people). However, smaaller objects that lie at a differen nt depth can confound stitching, and appear broken or m multiple times in the panoram ma. In Fig.2 (Top), the panoramic image shows a smoooth stitching of the facades, bu ut power lines which are at different depths are distorrted. Fig. 2 (Bottom) demonstraates how removing linear clutter such as power lines enhances the quality of panoraamas. We present a novel meth hod for the automatic removal of linear clutter from muultiperspective panoramas. Ourr method focuses on the removal of linear features that are situated in front of high-con ntrast backgrounds, such as power lines in front of the ssky. Our method uses a modifieed Hough transform to detect problematic thin horizonntal features. We remove unwaanted horizontal features with a short linear filter. Thhese steps form a method that im mproves the appearance of automatically constructed paanoramas. Our method also reeduces the amount of user intervention needed for the cconstruction of high-quality mu ulti-perspective imagery.
Fig. 2. (Top) Panorama stitched d from a group of images taken along a street, including horizoontal line multi-perspective stitching g artifacts caused by power lines. (Bottom) The same scene whhere power line artifacts are removeed. (Note: We didn’t intend to remove close to vertical lines)
Linear Clutter Removal from Urban Panoramas
87
2 Background Methods for automatically detecting and removing wires from images have been developed for outdoor power line inspection and for the cinema special effects industry. In this section, we place our work in the context of past wire detection and removal methods. We also discuss limitations of past work, and we explain how our method overcomes these limitations. In collaboration with power line maintenance companies, two computer vision studies present methods for detecting power lines in aerial images. These studies enable the use of using small airplanes for inspecting outdoor power lines. Yan et al. apply a Radon transform to extract line segments from power lines in aerial images [24]. Next, Yan et al. use a grouping method to link the line segments and a Kalman filter to connect the detected segments into an entire line. Mu et al. extract power lines from aerial images with a Gabor filter and a Hough transform [13]. The studies by Yan et al. and Mu et al. make the simplifying assumption that power lines are perfectly straight [13, 24].These studies also assume that power lines are made out of a special metal, which has a uniform width and brightness. In contrast, our method breaks image regions into small linear parts that allow power lines to curve, and rely on contrast but not constant color along the line. Therefore, our method succeeds in detecting linear clutter artifacts with varying width and brightness. Also, unlike these power line detection methods, our method both detects and removes the linear clutter from images. Hirani and Totsuka developed a method for removing linear clutter from video frames [8, 9].Their method is especially targeted toward the cinema special effects community. The Hirani-Totsuka method succeeds in applications such as removing wires that actors hang from while doing stunts and removing scratches in old film. Hirani and Totsuka achieve linear clutter removal by applying projection onto convex sets (POTS).The method is effective specially for complex backgrounds, but it is not fully automated in users perspective since it requires the user to manually choose the linear clutter regions. In contrast, our linear clutter removal method is fully automated although it solves this need for user intervention by extracting sky regions.
3 Linear Clutter Detection Existing methods for extracting lines from images, such as the methods discussed in Section 2, rely on either the Hough or Radon transform [3,7]. These line detection techniques alone are insufficient for removing telephone and power wires. First, these wires are usually not straight lines and form catenoid curves. Second, current line detection techniques utilize edge detection output, which for a thin line appears as a pair of edge-detector gradient lines on each side of the wire instead of the wire itself. (We illustrate this in our Experimental Results section and in Fig.7.) We customize these edge detection approaches to handle thin, horizontal features. We also consider that the color of the top and bottom neighboring pixels on a linear wire are similar. This criterion further enhances our line detection by making sure the color of the regions on each sides of the line are the same which is contrary to generic edge detection filters. Moreover, we consider that wires can have different diameters so we have to capture them at any width.
88
M. Kamali et al.
Fig. 3. Finding the sky region of o an image. (Left) Original Image. (Center) Sky Confidence M Map. (Right) Our Refined Sky Confiidence Map.
Due to the visual complexity of building structures, we are less interested in removing lines from front off buildings facades, and focus primarily on thin horizonntal occlusions of the sky regio on. Building façade structures have complex textures tthat themselves often contain many m horizontal lines (like window separators and bricks). We seek to avoid blurring the fine details of these building textures. Therefore, we focus on the more distractin ng sky related candidate regions for line removal. Our w wire removal algorithm first ideentifies the region of the image corresponding to the ssky, and then tracks and removees thin linear features within this region. We first characterize the sky. Using the input images for the panorama, we find sky related pixels using a depth h map if available [2], SkyFinder [20], or scene interpreetation [10]. We then create a 2-D (HxS) histogram of the hue and saturation valuess of the pixels detected as “sky,” and selecting the most popular hue/saturation combiinations as the sky color. We illustrate i an example sky mask in Fig.3. We then construct a sky y mask, where each pixel in the mask is its value from the (normalized) sky histogram m for that pixel’s hue and saturation. The resulting m mask will be noisy and contains many m small non-sky regions so we filter it using a Gausssian (or edge-preserving bilateraal) low-pass smoothing filter, followed by a morphological “opening” operation consissting of erosions followed by dilations to remove featuures such as windows reflecting the sky. For extracting the wire confidence map, we convolve the image with a sett of different vertical width filtters in order to find the pixels that most likely belongg to horizontal lines. We definee a family of filters Filter1 = [1 … 0 … -1]T and a secoond family of filters Filter2 = [1 1 … -2 … 1]T. Filter1 searches for pixeels whose top and bottom neighbors are similarly colorred. Filter2 searches for pixels that are significantly darker than their vertical neighbors. For 512x512 pixel input im mages, we observed that the number of pixels in both fillters ranges from 3 through 11 (this range is the parameter which users need to provvide m). before running our algorithm We compute the quotien nt Filterl(pi) = |Filter2l(pi) / Filter1l(pi)| for each filter wiidth 3,5,…,11, and for each pixeel pi in the sky region. We show an example applicationn of these filters in Fig.4. For eaach pixel, we pick the largest absolute value returned frrom all filter sizes and scale th he result by the sky region confidence map, max , ,…, . Two variables called min_line_width and max_line_width (in our exaample 3 and 11) need to be provided by the user.
Linear Clutter Removal from Urban Panoramas
89
Fig. 4. (Top left) Original imaage. (Top center, top right, bottom row) Line confidence mapp for filter widths of 3, 5, 7, 9 and 11 pixels.
Using a generic Hough transform, t some pixels will be detected that don’t belongg to horizontal lines. We modiffy the Hough transform to find candidate partial horizonntal lines in the image. We rem move these pixels by considering the gradient entropyy at each pixel. Pixels which belong b to a gently curving line should have low gradiient direction entropy, so we rem move pixels with high gradient direction entropy. This can easily be done by passin ng a smoothing filter over an image of the gradiient tions ) of the input image of potential lines. We create four bins for angles a (0-45), (45-90), (90-135) and (135-180) degrees tthat are incremented when a piixels gradient falls within that range of directions. If the entropy of a bin is above 80% of the maximum entropy value for a line (since we have 4 bins the maximum line entropy is about 1.39 [26]) this means this region belongs to a non-consistent grradient (clutter) so we remove it. Line segments near bou undaries of sky regions can be missed by this classiffier. Hence, our Hough transform’s bins are restricted to horizontal angles from -45 too 45 degrees, and from the peak ks of its histogram of line parameters, we find the corrresponding pixels in the line im mage. When these detected lines end near the boundaryy of the sky region, we extend the line to the boundary. We also break up long line ssegments into smaller chunks to t more accurately represent curved lines. As illustrated in Fig. 5 right, r since we want to eliminate false points on extraccted lines, for each pixel in the lines l detected by our modified Hough transform, we creeate a vertical neighborhood (in n our case six pixels above and below the line pixel). We then search for the peak co ontrast pixel in the vertical neighborhood to find the bbest corresponding point on the line. For each neighborhood, we find the highest contrast pixel and fit a regression lline to its neighboring pixels fo or each detected line segment. If the variance of the diffference between these high-co ontrast pixels and the regression line exceeds a predefiined threshold then we reject thee line segment.
90
M. Kamali et al.
Top Pixels
Line Pixels
Bottom Pixels
nd Bottom Pixels of Partial Line. (Right) Finding peak pixels allong Fig. 5. (Left) Choosing Top an a line segment.
4 Linear Clutter Rem moval In this step we pass a bilateral median filter over the image using neighborhood ssize (max_line_width*3, max_liine_width*3), where max_line_width was defined in S Section 3. Having found the peak p pixels from the previous step, we create a new m map consisting of peak pixels and their vertical neighbors within filter_width distannce. filter_width refers to the filter size which had the highest return for line detection. We replace each pixel in this new n removal map with its median filter image value whhich was extracted at the beginniing of the removal step (Fig. 6).
Fig. 6. Blurring (L Left) original image. (Right) blurred horizontal wires.
5 Experimental Resu ults We implemented the lineaar clutter detection and removal algorithm describedd in Sections 3 and 4 in MATL LAB. We tested the performance of each component oon a 64 bit, 2.2GHZ computer. In our tests, we found that calculating the sky mask taakes about 0.7 seconds in MATL LAB. The subsequent wire detection steps require rougghly 12 seconds of runtime perr image (512x512 pixels). Blurring the image to rem move linear clutter takes less than n one tenth of one second. We predict that, if we implem ment our method in C++ instead d of MATLAB, a further performance improvement woould
Linear Clutter Removal from Urban Panoramas
91
be easily attainable. As meentioned earlier, the main and most important parameters that are needed for this algo orithm are the min and max line width. One of the most aspects of our method is that is that our unique filter which foccuses only on extracting lines which belong to wires on high-contrast backgrounds. O Our method avoids extracting ed dges and linear features on building facades. Fig. 7 dem monstrates the advantages of ou ur method over two general edge detection techniques.
Fig. 7. Doubled lines in edge detection d vs. single lines in our method, top-left: original pannorama, top-right: our method, botttom-left: sobel, bottom-right: canny
A challenge to our algorrithm was that the facades of the buildings which contaiined big sky colored regions (succh as reflection of the sky on the windows) made the rejjection fail on those regions an nd hence, blurred (Fig. 8).
Fig. 8. (Left) Original Image, (Right) ( linear clutter removal result. Problem is visible on bluurred pixels on windows which matcch sky color and didn’t get avoided.
Fig. 9 shows some samples of real urban scene panoramas which their linear cluutter has been removed using ou ur technique. As it’s visible from the images, the clutteer in these panoramas has signifiicantly been reduced. Another fact to considerr on our method is deciding how much blurring trees vs. removing all the visible cllutter mattered. This affected how we chose the rejecttion threshold for gradient entro opy. Fig. 10 shows an example of choosing different enntropy thresholds. Particularly on the left image, the evergreen tree top is blurred duee to the low entropy rejection th hreshold.
92
M. Kamali et al.
Fig. 9. Experimental Results on Different Urban Panoramas
Linear Clutter Removal from Urban Panoramas
93
Fig. 10. Effect of different rejjection entropy thresholds on blurring the trees. (Left) low tthreshold (look at the left big everg green). (Right) high threshold.
6 Conclusion We demonstrated a techniq que for identifying and removing line clutter from imagges. This method applies to thin n, quasi-horizontal, quasi-linear features that cross the ssky. Our technique enhances paanoramic scenes that contain power lines or other linnear clutter. In future we could take a look at replacing the removed lines with clean Beezier curve replacements and synthetic telephone lines in order to create an exact maatch to the original scene. Our technique is already being integrated into a well-knoown urban navigation application.
References 1. Agarwala, A., Agrawala, M., Cohen, M., Salesin, D., Szeliski, R.: Photographing llong oint panoramas. ACM Trans. Graph 25, 853–861 (2006) scenes with multi-viewpo 2. Battiato, S., et al.: 3D steereoscopic image pairs by depth-map generation. In: Sympossium on 3D Data Processing, Visualization, V and Transmission (2004) 3. Beylkin, G.: Discrete radon r transform. IEEE Trans. Acoustics, Speech, and Siggnal Processing 35, 162–172 (1987) ( 4. Blazquez, C.H.: Detectio on of problems in high power voltage transmission and distrribution lines with an infrared d scanner/video system. In: SPIE, pp. 27–32 (1994) 5. ColorPilot. Retouch Unw wanted Objects on Your Photos (2011), http://www.colorp pilot.com/wire.html 6. Fu, S.Y., et al.: Image-baased visual servoing for power transmission line inspection roobot. International J. of Modellling, Identification and Control 6, 239–254 (2009)
94
M. Kamali et al.
7. Ginkel, M.V., Hendriks, C.L., Vliet, L.J.: A short introduction to the Radon and Hough transforms and how they relate to each other. Delft University of Technology Technical Report (2004) 8. Hirani, A., Totsuka, T.: Projection Based Method for Scratch and Wire Removal from Digital Images. United States Patent US 5974194 (1996) 9. Hirani, A.N., Totsuka, T.: Combining frequency and spatial domain information for fast interactive image noise removal. In: SIGGRAPH, pp. 269–276 (1996) 10. Hoiem, D., Efros, A., Herbert, M.: Closing the loop in scene interpretation. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008) 11. Kent, B.: Automatic Identification and Removal of Objects in Image Such as Wires in a Frame of Video. United States Patent Application US 208, 053 (2008) 12. Kopf, J., Chen, B., Szeliski, R., Cohen, M.: Street slide: browsing street level imagery. ACM Trans. Graph 29 (2010) 13. Mu, C., Yu, J., Feng, Y., Cai, J.: Power lines extraction from aerial images based on Gabor filter. In: SPIE (2009) 14. Pulli, K., Tico, M., Xiong, Y.: Mobile panoramic imaging system. In: CVPRW, pp. 108– 115 (2010) 15. Rav-Acha, A., Engel, G., Peleg, S.: Minimal Aspect Distortion (MAD) Mosaicing of Long Scenes. International J. of Computer Vision 78, 187–206 (2007) 16. Roman, A., Garg, G., Levoy, M.: Interactive design of multi-perspective images for visualizing urban landscapes. IEEE Visualization, 537–544 (2004) 17. Roman, A., Lensch, H.P.: Automatic Multiperspective Images. In: Eurographics Symposium on Rendering Techniques, pp. 83–92 (2006) 18. Seymour, M.: The Art of Wire Removal (2007), http://www.fxguide.com/article453.html 19. Szeliski, R.: Image Alignment and Stitching: A Tutorial. Foundations and Trends in Computer Graphics and Vision 2, 1–104 (2006) 20. Tao, L., Yuan, L., Sun, J.: SkyFinder: Attribute-based Sky Image Search. ACM Trans. Graph. 28 (2009) 21. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: IEEE International Conf. on Computer Vision, ICCV (1998) 22. Vallance, S.: Multi-perspective images for visualisation. In: Pan-Sydney Area Symposium on Visual Information Processing, VIP (2001) 23. Xiao, Z.: Study on methods to extract transmission line information from high-resolution imagery. In: SPIE (2009) 24. Yan, G., et al.: Automatic Extraction of power lines from aerial images. IEEE Geoscience and Remote Sensing Letters 4, 387–391 (2007) 25. Zuta, M.: Wire Detection System and Method. United States Patent US 6278409 (2001) 26. Rheingold, H.: Tools for Thought: The History and Future of Mind-Expanding Technology, ch.6. The MIT Press, Redmond (2000)
Efficient Starting Point Decision for Enhanced Hexagonal Search Do-Kyung Lee and Je-Chang Jeong Department of Electronics and Computer Engineering, Hanyang University
[email protected],
[email protected].
Abstract. In order adapt the center-biased characteristic of motion information in the real world video sequences, an improved method for starting point is proposed in this paper. For precise prediction of motion information in current block, we referred to motion vector of blocks in the reference frame and current frame. We also modified the search pattern of first step in enhanced hexagonal search. Experimental results show that the proposed algorithm reduces computational complexity in terms of the both time and search point, and improve peak-to-signal ratio of video sequence.
1
Introduction
Motion estimation (ME) is an indispensable part of many video coding standards such as MPEG-1/2/3 and H.261/263/264. It performs an important role to reduce the temporal redundancy between adjacently located frames by using Block-Matching Algorithm (BMA). Frames are divided into square shaped block, so-called macroblocks (MB). The BMA attempts to search a block in the reference frames (past or future frames), the block which is target of BMA has minimal distortion in terms of Sum of Absolute Difference (SAD), Sum of Squared Difference (SSD), MSE (Mean Squared Error) and etc. The search order is commonly started from the position of block in the current frame, the distance between current block and best-matched block is expressed as the motion vector which has components of x-axis and y-axis. Since the full search (FS) algorithm have high intensive computation, which search all the candidate blocks within the search window completely. For the last two decades, a lot of fast motion estimation algorithms have been proposed to reduce computational complexity without noticeable Peek to Signal-Noise Ratio (PSNR) loss. More than 80% of the blocks in video sequence can be considered as stationary or quasistationary blocks, it results in a center-biased global motion vector distribution instead of a uniform distribution. This implies that the chance to find the global minimum is much higher within the center 4x4 region of the search window. The algorithm which taking coarse search in predetermined window are the three-step search (3SS) [1], the new three-step search (N3SS) [2], the four-step search (4SS) [3], the diamond search (DS) [4], the new diamond search (NDS) [5], the hexagon-based search (HEXBS)[6], Enhanced Hexagonal Search (EHS) [7], and etc. Comparing with HEXBS, the EHS improve the performance in point of search speed and PSNR by adopting a 6-side-based fast inner search and prediction of starting point which named predictive HEXBS. As the correlation of between G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 95–103, 2011. © Springer-Verlag Berlin Heidelberg 2011
96
D.-K. Lee and J.-C. Jeong J
neighboring pixels is very high, 6-side-based fast inner search is found approprriate motion vector and save thee search point in inner area of hexagon . Also the mottion vector of the current block is similar to their neighbor blocks, predictive HEXBS can conjecture the motion vecto or of current block. In this paper, we imprrove predictive HEXBS. A small vector means that the motion vector which is foun nd in the inner search area when the first step is proceedded. As method can predict mo otion vector more precisely, the possibility of the sm mall vector found is increased dramatically. Using this phenomenon, the search patttern only in first step is modiffied. Section 2 provides the details of EHS briefly, and Section 3 explains the proposed p algorithm about starting point decision and adjustment of search patterrn. In Section 4, experimental results are presented of our method with the FS, TSS, NTSS, NDS, HEXBS and EHS. Finally, we conclude the paper in the section 5.
2
Enhanced Hexago onal Search (EHS)
The search point of conveentional HEXBS is described in Fig. 1(a) and the innner search that is squared area around point 0 is accomplished by checking b, h, d, annd f. One-more-step (OMS) HEX XBS in [6] has additional four search points (a, c, e, g)), if point b is selected in previo ous step than point a and c will be compared with poinnt b for the detailed motion estimation. With the similar contribution, point c and e is additionally checked if poin nt d is selected in previous step, point e and g is checkeed if point f is selected and pointt a and g is checked if point h is selected.
(a)
(b)
(c)
p of HEXBS and EHS. The points 0, 1, 2, 3, 4, 5, and 6 is Fig. 1. (a) The basic Search points checked for coarse search, and d the square-shaped point a, b, c, d, e, f, and g is checked whenn the inner search is proceed. (b) The T case of the Group 2 has smallest distortion among the oother groups. (c) The case of the Gro oup 6 has smallest distortion among the other groups.
The computational comp plexity of OMS inner search is heavy burden for ME siince we should calculate the 6 points p of inner search area except origin point. Therefoore, EHS reduce the weight of OMS inner search using the method indicated in 6-siidebased fast inner search. It iss a group-oriented method that firstly divides the calculaated
Efficient Starting Point Decision for Enhanced Hexagonal Search
97
coarse search points into 6 groups. We evaluate a group distortion that is the resullt of summing the distortion of group members. The colored blocks in Fig. 1(b) and F Fig. 1(c), is significantly correllated with their neighbors and each other. Thus, we can reduce the number of search h points in inner search area as the possibility that the pooint has the smallest distortion is i located near the region with smallest group distortion is higher. As shown in Fig. 1((b), only the three additional points placed near Group 2 or Group 5 will be included when w we determined motion vector. Also, two inner points is added if Group 1, Group p 3, Group 4, and Group 6 have minimum group distortiion, as shown in Fig. 1(c). Basically, it is obvious that the correlation of motion vectors among the currrent block and its neighboring blocks is very high. By exploiting this idea, predicttive HEXBS utilizes motion info formation of neighboring blocks to predict motion vectoor of current block. EHS use th he upper and the left neighboring blocks by calculatting distortion between two mo otion vectors of the neighboring blocks and zero mottion vector. Since EHS do not check all search points in the search window, predicttive HEXBS help EHS to find better b motion vectors and save the search points.
3
Proposed Algorith hm
Since the operations for ev valuating distortion between two blocks spend lots of coosts for motion estimation, it is effective way for speed-up to reduce search points durring motion estimation. Many algorithms for fast motion estimation adopted variious pattern shape, like diamond d and hexagonal search. However, they are coarse seaarch patterns, there are weak po oint of algorithms. In section 3.1, we introduce to get the better starting point before motion estimation, it can lead to reduce search points and 3 as the starting point becomes more reliable, we needd to improve PSNR. In section 3.2, modified search patterns off EHS for reducing search points more. 3.1
An Efficient Startin ng Point Decision
Predictive HEXBS in EHS [7] refer to left and upper block (Block A and Block B in Fig. 2(b)) of current block k to relocate starting point. The zero motion vector thaat is motion vector of current blo ock is compared with motion vectors of reference blockks in
Fig. 2. The location of refereence blocks in current frame. We also use co-located blockks in reference frame.
98
D.-K. Lee and J.-C. Jeong
terms of distortion. However, as the predicted motion vector of predictive HEXBS is not always best answer, we can supplement candidate motion vector by adopting additional reference block not only in the current frame but also the reference frame, as shown in Fig. 2. Co-located block of reference frame is placed same position of current block spatially, its motion vector will be more reliable candidate since it would be highly correlated with motion vector of current block. Also, we can predict motion vector well using Block A, B, C, D, E, F, and G in the current frame and colocated block in reference frame. Table 1. The table is the number of being selected as a candidate reference in terms of distortion. The specific location of current block, co-located block, and block A ~ G is indicated in Fig. 2. Sequence
# of fram e
akiyo
300
C urrent C o-located Block Block 114,918
78,275
Block A
Block B
Block C
368
804
139
Block D 9
Block E
Block F
81
53
Block G 10
bus
150
4,918
48,921
3,402
1,755
236
451
73
53
26
football
150
26,889
20,492
5,523
5,705
2,272
868
1,328
862
625
forem an
300
41,601
40,469
16,812
12,427
3,636
1,969
6,549
1,166
897
hall_m onitor
300
105,920
24,106
2,072
4,993
655
327
379
1,861
675
m obile
300
27,229
90,435
2,484
1,844
291
193
510
87
32
stefan
300
36,301
53,334
21,872
8,215
2,051
2,153
1,068
456
352
table
300
80,612
48,689
3,745
2,594
1,020
567
644
509
306
tem pete
260
68,095
39,904
3,811
2,641
868
761
323
208
181
Sum
506,483
444,625
60,089
40,978
11,168
7,298
10,955
5,255
3,104
possibility
46.47%
40.79%
5.51%
3.76%
1.02%
0.67%
1.01%
0.48%
0.28%
As shown in Table 1, the motion vector of the current block had been selected mostly, since a lot of block in the frame is classified as static movement. The proportion of the zero vector is the motion vector of current block is about 46%. Usually, the video sequence has high correlation not only spatially but temporally, motion information of frames located temporally adjacent can be overlapped. Thus, the tendency of motion vector distribution in the current frame will be almost similar with the motion vector distribution of the reference frame, the motion vector of the co-located block wins the high score as optimal motion vector. The possibility of colocated block is 40.79 %, and it is noticeable result for predicting motion vector of current block, we should utilize the motion vector of co-located block for prediction. Although the proportion of Block A, B, C, D, E, F, and G is quite low, it should not be ignored because they have an influence on the experimental result in point of PSNR. Consequently, we utilize the reference motion information located at Block A, B and C, and also refer to zero vector and motion vector of co-located block. The Adjustable Partial Distortion Search(APDS) algorithm [11] is one of the partial distortion search algorithm is used to evaluate only relocating starting point. It is improved version of Normalize Partial Distortion Search (NPDS) [10] algorithm, it achieves remarkable performance about speed-up and PSNR. APDS can suitable method to reduce block matching time without noticeable PSNR loss.
Efficient Starting Point Decision for Enhanced Hexagonal Search
3.2
99
Modification of Searrch Pattern
By evaluating improved staarting point of the current block, we can precisely preddict motion vector of current blo ock. Since EHS does not check all point in search window, prediction of precise motion vector will be useful tool for improvement of EH HS. When we utilize starting g point decision algorithm for EHS, there will bee a phenomenon that the motio on vectors are focused inner search area which is the reggion of colored blocks in Fig. 1(a). The change of search pattern is needed to reduce seaarch points following more preciise prediction.
Fig. 3. Frame by frame comp parison about the number of small vector for hall monitor CIF sequence
As shown in Fig. 3, the number n of small vector is increased at almost every fraame when we apply starting po oint decision method in section 3.1 to EHS. The coaarse search points are firstly located a region where the motion vector is expected to exxist, in conventional hexagonal search, the region is the vertex points of large hexaggon shown in Fig. 1(a). Thus, th he point 0, 1, 2, 3, 4, 5 and 6 are firstly checked, if the pooint 0 has smallest distortion, in nner search is performed to get final motion vector. If one of the other points (point 1, 2, 3, 4, 5) has minimum distortion, the point has minim mum distortion will be origin poiint of new large hexagonal pattern. Since the possibilityy of small vector is increased, by b using new strategy for relocating starting point, we nneed to modified search pattern n of first step of EHS. The modified algorithm cann be summarized in the followin ng steps. STEP1. The new inneer search point consisted of points 0, b, d, f and h is firrstly checked. If th he center point has minimum distortion, the starting pooint that defined previous p section is the final solution of the motion vecctor, otherwise, pro oceed to STEP 2.
100
D.-K. Lee and J.-C. Jeong J
STEP2. With the disto ortion of center point (point 0) in the previous step, a laarge hexagonal seaarch is performed for point 1, 2, 3, 4, 5 and 6. If the cennter point still has minimum distortion, go STEP3, otherwise the conventioonal EHS start from m the point has minimum distortion is proceed. STEP3. Since still thee center point has minimum distortion, one of the poinnt b, d, f and h willl be final solution of the motion vector. Thus the point has smallest disto ortion among these four points will be final motion vectoor.
Fig. 4. Frame by frame compaarison about the number of search points for mobile CIF sequeence
As shown in Fig. 4, the proposed p algorithm that is the combination of starting pooint decision and modified searrch pattern is always better than the combination of E EHS and starting point decisio on. It means that the modified search pattern get the remarkable result to reduce search points. In the first frame, the number of seaarch point is more than other fraame, since the first frame can not refer to the motion vecctor of co-located block in referrence frame. From the following experimental results, we can observe that the propossed algorithm can achieve near 31.4% speed improvem ment over the EHS in terms of seearch points.
4
Experimental Ressults
To verify the performancee of the proposed algorithm, the FS, TSS, NTSS, ND DSS, HEXBS and EHS are comp pared with the proposed algorithm. The experimental seetup is as follows: the distortion n measurement is sum of absolute difference (SAD), the size of search window is 16 pixels in both the horizontal and the vertical directiions and block size of 16 16 6. Nine representative CIF video sequences, “Akiyo” (300 frames), “Bus” (300 framess), “Football” (150 frames), “Foreman” (300 frames), “H Hall monitor” (300 frames), “M Mobile” (300 frames), “Stefan” (300 frames), “Table” (300 frames) and “Tempete” (260 frames), were used for demonstration.
Efficient Starting Point Decision for Enhanced Hexagonal Search
101
Table 2. Experimental result of proposed algorithm about PSNR and second per frame Sequence (fram es)
FS PSN R
akiyo (300)
sec/fram e
b us (150)
sec/fram e
foo tball (150)
sec/fram e
forem an (300)
sec/fram e
h all m on ito r (300)
sec/fram e
m o bile (300)
sec/fram e
stefan (300)
sec/fram e
table (300)
sec/fram e
tem pete (260)
sec/fram e
average
PSN R PSN R PSN R PSN R PSN R PSN R PSN R PSN R PSN R sec/fram e
TSS
N TSS
NDS
H EXBS
EH S
Proposed
42.34
42.19
42.33
42.01
41.28
41.92
42.27
0.4354
0.0366
0.0281
0.0082
0.0078
0.0078
0.0041
25.59
24.27
24.41
22.24
22.00
23.57
25.12
0.4770
0.0388
0.0414
0.0142
0.0112
0.0089
0.0067
24.08
23.46
23.29
22.94
22.78
23.27
23.53
0.4838
0.0400
0.0382
0.0110
0.0094
0.0087
0.0071
31.81
30.72
29.63
29.43
29.19
30.99
31.48
0.4423
0.0380
0.0363
0.0123
0.0093
0.0085
0.0075
34.63
34.56
34.57
34.50
34.39
34.46
34.53
0.4345
0.0372
0.0293
0.0090
0.0073
0.0076
0.0055
25.04
24.57
24.99
24.24
24.36
24.20
24.93
0.5103
0.0424
0.0435
0.0105
0.0080
0.0081
0.0057
23.90
22.43
23.39
20.96
20.91
23.40
23.93
0.4546
0.0381
0.0386
0.0126
0.0100
0.0087
0.0072
31.46
30.11
30.25
30.01
29.56
30.41
30.93
0.5065
0.0422
0.0359
0.0108
0.0084
0.0084
0.0056
27.79
27.62
27.68
27.22
27.30
26.75
26.82
0.5273
0.0445
0.0395
0.0111
0.0088
0.0089
0.0056
29.63
28.88
28.95
28.17
27.98
28.77
29.28
0.4746
0.0398
0.0368
0.0111
0.0089
0.0084
0.0061
As shown in Table 2, we checked the time of motion estimation in terms of second per frame, because some sequences have different number of frames. NTSS is focused on the motion field around the zero vector has good performance in Akiyo sequence since most of the motion information in Akiyo sequence is a zero vector or a vector defined in ±1 pixels. Since the motion activity of Football, Stefan, Bus and Mobile is high and include global motion like zoom-in or out, it is not easy to predict motion information or find optimal motion vector. Thus, in these sequences, the average PSNR is lower than others. The proposed algorithm has good performance in terms of the both PSNR and speed-up. It is clearly seen that proposed algorithms has achieved fastest motion estimation among the other algorithm. Also, the proposed algorithm improves quality of result in terms of PSNR, but there is still noticeable PSNR compare with FS. Because TSS, NTSS, NDS, HEXBS, EHS and proposed algorithm is coarse search method, PSNR loss can not be avoided. Compare with EHS, the proposed algorithm has about 0.5 dB PSNR gain and 27.4 % faster in terms of second per frame. As show in Table 3, we also compare with FS, HEXBS and EHS in terms of search point per block. The search points per block of FS algorithm is always 1089 (33 33) because the FS checked all point in the search window (±16 pixels). The average value of proposed algorithms is less than EHS. It seems that only few points is decreased when we use proposed algorithm, but if the sequence has 300 frames, there are 118,404 blocks to be evaluated for motion vector, and we can save almost 134,980 search points compare with EHS. It is obvious that the proposed algorithm has small costs for implementation.
102
D.-K. Lee and J.-C. Jeong
Table 3. Experimental result of proposed algorithm compare with FS, HEXBS and EHS about search points per block akiyo
bus
fo o tb all fo rem an
h all m o n ito r m o b ile stefan
tab le
tem p ete
averag e
FS
1089
1089
1089
1089
1089
1089
1089
1089
1089
1089
H EX B S
11.10
8.84
7.31
15.80
11.57
13.16
16.34
12.88
10.76
11.97
EH S
11.32
5.99
6.22
12.16
11.34
11.06
12.33
11.63
9.92
10.22
Pro p o sed
8.86
5.13
6.04
12.25
9.77
9.25
12.03
9.76
8.65
9.08
5
Conclusion
In this paper we proposed a new efficient starting point decision method for EHS. The EHS utilize the motion information of only left, right and zero motion vector, this method can not predict starting point precisely. We compensate the defect of EHS by referring not only the motion information in current frame but also reference frame. For reducing search points, we additionally proposed modifying search patterns in EHS. Simulation result showed that the proposed algorithm is the fastest method among the other BMA compared in experiment. Also, the video quality in terms of PSNR is significantly improved. Thus, the proposed algorithm is appropriate motion estimation for a wide range of video applications such as low-bitrate videoconferencing. Acknowledgement. This work was supported by the Brain Korea 21 Project in 2011.
References 1. Koga, T., Iinuma, K., Hirano, A., Iijima, Y., Ishiguro, T.: Motion compensated interframe coding for video conferencing. In: Proc. Nat. Telecommun. Conf., New Orleans, L.A., pp. G5.3.1–G5.3.5 (November-December ) 2. Li, R., Zeng, B., Liou, M.L.: A new three-step search algorithm for block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 4, 438–443 (1994) 3. Po, L.M., Ma, W.C.: A novel four-step search algorithm for fast block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 6, 313–317 (1996) 4. Tham, J.Y., Ranganath, S., Ranganath, M., Kassim, A.A.: A novel unrestricted centerbiased diamond search algorithm for block motion estimation. IEEE Trans, Circuits Syst. Video Technol. 8(4), 369–377 (1998) 5. Zhu, S., Ma, K.K.: A new diamond search algorithm for fast blockmatching motion estimation. IEEE Transactions on Image Processing 9(2), 287–290 (2000) 6. Zhu, C., Lin, X., Chau, L.P.: Hexagon-based search pattern for fast block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 12, 349–355 (2002) 7. Zhu, C., Lin, X., Chau, L.P.: Enhanced Hexagonal search for fast block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 14, 1210 (2004) 8. Hosur, P.I., Ma, K.K.: Motion vector field adaptive fast motion estimation. In: 2nd International Conference on Information, Communications and Signal Processing (ICICS 1999), Singapore (December 1999)
Efficient Starting Point Decision for Enhanced Hexagonal Search
103
9. Tourapis, A.M., Au, O.C., Liou, M.L.: Predictive motion vector field adaptive search technique (PMVFAST) enhancing block based motion estimation. In: SPIE Conf. On Visual Communication and Image Processing, pp. 883–892 (January 2001) 10. Cheung, C.K., Po, L.M.: Normalized partial distortion algorithm for block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 10(3), 417–422 (2000) 11. Cheung, C.K., Po, L.M.: Adjustable partial distortion search algorithm for fast block motion estimation. IEEE Trans. Circuits Syst.Video Technol. 13(1), 100–110 (2003)
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction X. Zabulis, P. Koutlemanis, H. Baltzakis, and D. Grammenos Institute of Computer Science - FORTH, Herakleion, Crete, Greece
Abstract. A method is proposed that visually estimates the 3D pose and endpoints of a thin cylindrical physical object, such as a wand, a baton, or a stylus, that is manipulated by a user. The method utilizes multiple synchronous images of the object to cover wide spatial ranges, increase accuracy and deal with occlusions. Experiments demonstrate that the method can be applied in real-time using modest and conventional hardware and that the outcome suits the purposes of employing the approach for human computer interaction.
1
Introduction
Ubiquitous computing and ambient intelligence have introduced more natural ways of interacting with computers than the conventional keyboard and mouse. Recent trends in human computer interaction indicate the plausibility of tangible and natural interaction through modest hardware. The Nintendo Wii was the first popular system to provide this functionality, based on acceleration measurements and visual tracking of LEDs. The Sony PlayStation Move visually tracks the 3D location of luminous spheres, while the Microsoft Kinect sensor employs depth maps to infer user limb locations. Though used in everyday life, pointing objects such as a wand or a baton have not pervaded in such interfaces despite the fact that, aside location, they also convey orientation information. This work aims to provide a means of explicit interaction that is based on visually tracking a thin cylindrical rod manipulated by a user in 3D space, by estimating its location and orientation (pose) in real-time; henceforth, we call this object a wand. To estimate its pose, without any assumptions on its size, at least two synchronous views are required. More views can be utilized to increase accuracy, treat occlusions, and cover wider areas. To deal with various environments two wand detection approaches are proposed, one based on color and another on luminous intensity; in the latter case the wand is a light source. Intensity based detection is simpler, but requires instrumentation of the wand. The remainder of this paper is organized as follows. In Sec. 2 related work is reviewed. In Sec. 3 an overview of the proposed method is provided, which is analytically formulated in Sec. 4. In Sec. 5, experiments which evaluate the accuracy, performance and usability of the approach are presented. In Sec. 6, conclusions are provided. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 104–115, 2011. c Springer-Verlag Berlin Heidelberg 2011
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction
2
105
Related Work
In the domains of ubiquitous computing and ambient intelligence, physical objects are blended in interaction comprising “tangible interfaces” [1]. In this work, the user-interface item is decoupled from the services that it may provide. This study concerns the applicability of a wand as an item of explicit, real-time interaction. Having such a means, dedicated platforms can be then employed to employ such items in system interaction [2,3]. The need for a pointing device as an interaction item is underscored in [4], where a PDA is visually tracked to emulate a wand in an augmented reality environment. Multiview tracking of markers on the PDA provides pose information for augmenting the virtual wand, whereas this work employs a physical wand. This need is also found in efforts to capture human pointing gestures, i.e. [5], a task that, to date, has not been fully achieved. To the best of our knowledge a visually tracked wand in 3D has not been proposed as a means of interaction. The most relevant work is [6], where a pipette is tracked in from a single view to indicate points on a plane. Markers are utilized to track the wand, at a relatively slow rate (4 Hz). The geometrical basis of the proposed approach is the reconstruction of a straight 3D line segments from multiple images. We follow conventional approaches [7] in multiview geometry, to combine its multiple observations. An early approach to the problem is [8], formulated for 3 views. In [9] lines are only affinely reconstructed. More pertinent is the approach in [10], but which is iterative manner, as the goal is to match multiple segments. The approach in [11] is also relevant but assumes the existence of multiple interconnected line segments to detect endpoints, information which is not available in our case. In contrast to [12], we cannot assume a short baseline as the user may be performing rapid motions. We do not employ stereo [13], as it yields inaccuracies on thin objects.
3
Method Overview
A wand of unknown size is imaged from multiple views and may be occluded totally or partially in some. In each view, it is segmented and modeled as a line segment that approximates1 the projection of its major axis on the image plane. When segmentation is successful in 2 views, the object’s size and pose can be estimated. If more views are available, they are combined to increase accuracy. A synchronized and calibrated multicamera system is assumed. Each camera i is located at κi and has a projection matrix Pi . The image from each camera is compensated for lens distortion directly after its acquisition, forming image Ii . The output is the 3D line segment, represented by its endpoints e1,2 . The main steps of the proposed method are the following (see Fig. 1): 1
Due to perspective distortion, this projection does not coincide with the medial axis of the 2D contour, but for thin objects we assume it is a reasonable approximation.
106
X. Zabulis et al.
1. Segmentation. Each image Ii is binarized into image Mi to extract the wand. Segmentation may contain errors, such as spurious blobs, while the wand may not be fully visible due to occlusions. 2. 2D modeling. The wand is sought in each Mi , using the Hough Transform (HT) [14], which yields 2D line li . Input to HT is provided by a thinning of Mi . A line segment grouping process upon li determines the endpoints of the wand in Ii . 3. 3D pose estimation. The line L where the segment lies is estimated, as the one minimizing reprojection error to the observed line segments. Endpoint estimates are obtained from their 2D observations. In this process, outlier elimination is crucial as, due to occlusions and segmentation errors, the object may not be fully visible in all views. 4. Motion estimation, improves accuracy and robustness.
Fig. 1. Method overview (see text)
As the method aims to support real-time interaction, computational performance is of significance. Due to the large amount of data provided by multiple views, we strive for massive parallelization. Thus, techniques are formulated to be executed in parallel on data that are granuously partitioned. For the same reason, single-pass techniques are preferred over iterative ones. The CPU is pipelined at the end of operations to perform sequential processing, which is applied on very few data, in the tasks of outlier elimination and motion estimation.
4 4.1
Method Implementation Image Acquisition and Segmentation
Acquired images are uploaded from RAM to the GPU. For color images an additional byte per pixel is added to achieve coalesced memory access and efficient use of the texture cache. This byte also facilitates optimal execution of image interpolation on GPU hardware, which is employed to compensate for lens distortion and is applied immediately after acquisition to provide image Ii .
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction
107
Next, Ii is segmented into binary image Mi where, ideally, pixels have the value of 1 if they image the wand and 0 otherwise. Segmentation errors which may occur are treated in subsequent steps. Depending on setup, segmentation can be based on color or intensity; Fig. 2 demonstrates two such results. Both segmentation versions are parallelized per pixel. Using a wand of characteristic color, selected to be scarcely encountered in the scene, Mi is obtained as follows. A color similarity metric [15], robust to variations of illumination conditions, yields a similarity score per pixel. Each result is thresholded to generate binary image Mi . Using a bi-colored wand (see Table 1) the direction of the wand (besides orientation) can be disambiguated. In this case, the color similarity metric is applied twice, once for each color, and the results are merged in Mi . A luminous wand is segmented by intensity thresholding Ii to obtain Mi . This approach is more robust to illumination artifacts and accidental color similarity. Practically, a brief shutter time, i.e. Fig. 2(right), suffices for accurate segmentation of moderately luminous objects.
Fig. 2. Image segmentation. Examples of images Ii and segmentations Mi , for color (left) and intensity (right).
4.2
2D Wand Modeling
The output of this step is line segment si , approximating the projection in Ii of the wand’s major axis. This is achieved, first, by thinning Mi to obtain image Ti . Then, the HT on Ti estimates the line li in Ii , that this projection lies. Finally, a grouping process upon li determines si . All tasks are performed in the GPU. Thinning. This process performs a thinning of Mi so that an elongated foreground blob, such as the wand’s segmentation, is reduced to (an approximation of) its major axis image projection. Due to perspective projection, the wand does not exhibit constant thickness in Ii and, thus, a different amount of thinning is required at each image locus. To parallelize computation, a single-pass operation is employed, which estimates wand thickness at each image point and applies the proportional amount of thinning (see Fig. 3). First, Mi is convolved with a disk kernel D of a diameter large enough to be “thicker” than the wand in Mi . This results in image Qi . Pixels of Mi that were 0 are set to be 0 also in Qi . A priori, the convolution response of D for a range of line widths is computed and stored in a lookup table. Using this table,
108
X. Zabulis et al.
each pixel in Qi is converted to an estimate of wand thickness, at that locus. In essence, this is an optimization of scale-selection [16], outputting the distance of each pixel to its closest boundary. Next, convolution of Mi with a wide Gaussian kernel provides image Si . In Si , the wand appears as a smooth intensity ridge. Gradient ∇Si is then calculated. The thinned image is obtained through a non-maximum suppression process along the direction of ∇Si , that is applied on image |∇Si |. For each pixel p that Mi (p) = 0, the values of Si along the direction ∇Si (p) are suppressed if Mi (p) is a local intensity maximum along the direction of ∇Si (p) in Si . The spatial extent of this suppression is equal to the local width of the wand, as provided by Qi (p). That is, pixel T (p) is 1 if S(p) > S(p + α · v) holds for all α and 0 otherwise, where α ∈ [−Qi (p), ..., −Qi (p)] − {0}.
Fig. 3. Image thinning on a detail of the image in Fig. 2(right). Left to right: Mi , Qi (warmer colors indicate greater sizes), and |∇Si | with gradient direction vectors and the thinning result superimposed. The length of the plotted (green) vectors matches the corresponding size-estimate in Qi and indicates the spatial extent of nonmax suppression. The resulting “1” pixels of Ti are superimposed as red dots.
Line estimation. Pixels marked as 1 in Ti are passed to the HT, to estimate li . For each pixel p in Ti that T (p) = 1, corresponding locations in Hough-space are incremented by 1. This is performed in parallel for each pixel, but since concurrent threads may access the same pixel in Hough space, operations are serialized through atomic operations. The global maximum pm in Hough-space determines li and is passed to the next processing step. 2D line segment detection. Due to occlusions and segmentation errors, the wand does not appear as a single segment along li , while spurious blobs may also be encountered. A traversal of Mi is performed along li and connected components are labeled. Very small segments are attributed to noise and are disregarded. Size-dominant segments along li are grouped if they are separated by a distance smaller than τd ; the value of τd is set by adapting the line grouping metric in [17] for the 1D domain of line li . The longest detected segment is selected and its endpoints identified. If a bi-colored wand is employed, the matched color of each point is stored. If the length of the resulting segment is very small, the wand is considered not to be detected. The process is demonstrated in Fig. 4.
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction
109
Fig. 4. Line segment detection. Original image Ii (left), Mi (middle), and Mi with li and si superimposed (right); the red line is li and the green segment is si .
4.3
3D Line Segment Estimation
First, the line L where the 3D segment lies is estimated and then its endpoints e1 , e2 are estimated. When views are more than 2, the problem is overdetermined and an optimization approach is adopted to increase accuracy. Line estimation. For each view j that a line segment sj is detected, we define a plane Wj (see Fig. 1). This plane is defined by the camera center κj and two image points on si . The 2D endpoints of the segment can be used for this purpose, as it is of no concern whether the wand is fully imaged. Their 3D world coordinates on the image plane are found, as the intersection of the rays through these points with that plane (see [7], Eq. 6.2.2). When the wand is detected in 2 views, j = 1, 2, L is the intersection of W1 and W2 . If the wand is detected in n > 2 views, plane Wj for each view is considered. Ideally, planes Wj should intersect on the same 3D line, however due to noise and calibration errors this is hardly the case. Hence, the following optimization process is followed. Planes Wj are represented in a 4 × n matrix A. Each row of A represents a Wj plane, containing the parameters for its 4 equation parameters. Let A = U DV T the Singular Value Decomposition of A. The two columns of V providing the 2 largest singular values span the best rank 2 approximation to A (see [7], p323) and are used to define L. The Maximum Likelihood estimate of L is found by minimizing a geometric image distance between its image projection in image j and the measured line segment sj , in all Ij . A geometric distance metric for line segments is adapted from [18] and, in our case, formulated as d = (d21 + d22 )1/2 , where d1,2 are the 2D point-to-line distances between the endpoints of si and the 2D projection of the considered line (candidate L). This provides L, the line minimizing the sum of distance errors, between its projections and line segments sj . Endpoint estimation. A pair of views, say (k, j), is required to obtain an estimate of the wand’s endpoints, e1 and e2 . We consider the 2D endpoints in view k and the rays from κk through these endpoints. The corresponding two intersections of these rays with Wj provide an estimate for e1 and e2 each. The task is performed for all pairs where j = k, providing multiple point estimates.
110
X. Zabulis et al.
The 3D estimates are then clustered by a connected component labeling process: two points are assigned with the same label if they are closer than τa . We assume that the two clusters with the greatest cardinality correspond to the endpoints and the, potentially, remainder points to the outliers; besides noise, an outlier may be due to the fact that the wand is not fully imaged in some view. The images of the inliers from each cluster are triangulated, using Maximum Likelihood Triangulation, to obtain the reconstruction of each endpoint. The endpoint estimates e1,2 are the projections of these points on L. For a bi-colored wand each point is associated with a color (see Sec. 4.2) and, thus, 3D estimates are associated to physical endpoints. 4.4
Motion Estimation
Tracking improves the accuracy of pose estimation and corrects errors, i.e. when the wand is transiently lost or pose estimation is extremely inaccurate. The trajectory of the wand is tracked over time in the 6D space using a Kalman filter. To implement the filter we have assumed a 12D state vector x(t) given as: x(t) = [p(t); a(t); p (t); a (t)]T
(1)
where p(t) = [px (t), py (t), py (t)] is the wand’s first end-point, a(t) = [ax (t), ay (t), ay (t)] is the normalized direction vector pointing to the second point and p (t) and a (t) are the corresponding derivatives with respect to time (the speed components). The state vector x(t) is not directly observable. Instead, at each time instant t, we observe vector y(t) = [p1 (t); p2 (t)]T which is our 6D measurement vector and which consists of the Cartesian coordinates of the two endpoints of the wand p1 (t) and p2 (t). The resulting state-space model is described by the following equations: x(t) = Fx(t − 1) + w(t) w(t) ∼ N (0, U (t))
(2) (3)
y(t) = Hx(t) + v(t) v(t) ∼ N (0, Cy (t))
(4) (5)
where w(t), v(t) are independent, zero-mean Gaussian processes with variances U (t) and Cy (t), representing the transition and the observation noise at time instant t, respectively. F is the state transition matrix which is used to propagate the current state to the next frame and is selected to satisfy the following: x(t) = Fx(t−1) = [p(t−1)+p(t−1); a(t−1)+a(t−1); p (t−1); a (t−1)]T (6) H is the observation matrix which implements the relation of the hidden state with the measurement vector: y(t) = Hx(t) = [p(t); da(t)]T
(7)
where d is the is the length of the wand, estimated from previous frames. The state vector x(t) and its 12×12 covariance matrix Cx (t) are estimated recursively using the Kalman Filter equations [19].
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction
5
111
Experiments
To evaluate accuracy in different spatial scales, as well as, scalability with respect to number of views, the method has been tested in the following setups: 1. A 7-camera cluster in a 5 × 5 m2 room. Cameras are mounted at the ceiling, viewing it peripherally from a height of ≈ 2.5 m, through a 66◦ × 51◦ FOV. 2. A trinocular system installed 40 cm above an enhanced school desk. The maximum baseline is 71.5 cm and cameras verge at the center of the table, configured at a FOV of 43◦ × 33◦ . Image resolution was 960 × 1280 pixels, except when modulated to measure effects in accuracy and computational performance. The computer hosting these cameras employed an nVidia GeForce GTX 260 1.2 GHz GPU. 5.1
Accuracy and Computational Performance
To the best of our knowledge, there is currently no publicly available multiview dataset for pose estimation of a wand, annotated with high-precision ground truth data. Thus, such a dataset was created [20]. The dataset was collected using an 58 cm wand, mounted on a tripod with 2 degrees of freedom (pitch, yaw) and marked rotation gratings. The dataset sampled a wide range of poses, consisting of 36, 360◦-yaw rotations, in steps of 10◦ . The pitch angles of these rotations ranged from −70◦ to 80◦ , in steps of 10◦ . Occlusions were present, as in some views the wand was occluded by the tripod. To study the effects of resolution and number of views in the accuracy of estimates, they were modulated as shown in Table 1. We conclude that the method is sufficiently accurate for purposes of indicating points in space and that accuracy gracefully degrades to the reduction of input data. We observe that the results for 7 views are marginally more accurate than those for 4 views. Thus, in this setup, utilization of more than 4 views provides an advantage only in the presence of further occlusions. Table 1. Left: Indicative image from the dataset used for accuracy estimation. Right: Mean error and standard deviation results.
Views 2 3 4 7
480 × 640 Yaw Pitch 2.0◦ (3.1◦ ) 1.0◦ (1.4◦ ) 1.2◦ (1.1◦ ) 0.7◦ (0.7◦ ) 1.4◦ (1.6◦ ) 0.9◦ (1.2◦ ) 1.2◦ (1.2◦ ) 0.8◦ (0.9◦ )
960 × 1280 Yaw Pitch 1.4◦ (4.1◦ ) 0.8◦ (1.9◦ ) 1.2◦ (1.1◦ ) 0.7◦ (0.7◦ ) 0.9◦ (1.0◦ ) 0.6◦ (0.7◦ ) 0.9◦ (1.0◦ ) 0.6◦ (0.6◦ )
We performed two experiments to measure the performance of the method. First, for each step of the method, GPU execution time was measured and averaged over a time period of 1000 frames and compared to a reference CPU implementation, for 4 views of 960 × 1280 resolution; see Table 2, (left). Second,
112
X. Zabulis et al.
Table 2. Performance measurements. Left: Execution time for each computational step. Right: Total execution time for different number of views and image resolutions. Computational Step Lens distortion compensation Image segmentation Smoothing Thickness estimation Non-max suppression Line detection
CPU 17.3 ms 230.8 ms 20.6 ms 44.5 ms 8.1 ms 38.8 ms
GPU Speedup Views 480 × 640 960 × 1280 1.3 ms 13.3 2 30 Hz 15 Hz 1.2 ms 192.3 3 30 Hz 10 Hz 2.5 ms 8.2 4 22 Hz 7 Hz 4.0 ms 11.1 1.1 ms 7.4 7.0 ms 5.5
we measured performance while modulating the number of views and resolution; see Table 2 (right). We observe that the method is efficiently parallelized in the GPU and that it linearly scales with the amount of input. 5.2
Usability Studies
In order to test the usability, accuracy and response time of the method as perceived by end-users, 3 pilot applications were implemented. Characteristic images from these experiments are shown in Fig. 5.
Fig. 5. Images from usability experiments. Top: snapshots from the “room” experiment; right image shows the piano example. Bottom: (i, left) A user draws shapes by bringing the stylus in contact with a desk and dragging it; drawn shapes are virtual and superimposed on an original image from a system’s camera, as the projections of the points of contact with the surface. (ii, middle) A user brings a stylus in contact with predefined page regions of a book, to receive content-sensitive information. (iii, right) Image from the “game” experiment, where a player controls a hypothetical saber in the virtual world rendered on the screen.
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction
113
A room control application was created using the first setup. Each wall in the room provides visual output through a projector or a TV screen. On the ceiling, there are 9 computer-controlled RGB spotlights. A computer controlled door (open/close) exists on a wall. Initially, the 4 available screens show a paused video, spotlights are off, and the door is closed. A test participant and an observer enter the room through another door. The observer picks up a 2-color wand (≈ .5 m long) from the floor and showcases its use. When the wand is pointed at a screen, the respective video starts playing. When the wand stops pointing at it, the video pauses. If the wand points at any of the spotlights, then it turns green and if another was previously lit, it turns off. When the wand points at the door, a knocking sound is heard. If the wand remains pointing the door for 1 s, it opens if closed, and vice versa. After the demonstration, the wand is given to the test participant who freely experiments with it. To test more fine-grained actions, a piano keyboard is projected on a wall that can be played by pointing at its keys. A colored dot is projected on the wall at the position where the wand is (estimated to be) pointing at, to provide feedback. Game. Using the first setup, a game was developed. The goal was to determine whether the system’s accuracy was sufficient and its latency small enough to support dexterous and rapid interaction. The user stands in front of a TV screen, using a 58 cm wand as a saber. The system captures the pose of the wand and reconstructs it within a 3D gaming environment, which is rendered on the screen. Using the wand, the user controls the virtual saber to “hit” incoming targets in the form of spheres coming towards him/her. Desk. Using the second setup we employed the method to track a 14 cm stylus in order to provide interaction of a user with a (i) planar surface and a (ii) physical book. The, corresponding, goals of the experiment were to determine whether the system is sensitive enough to detect the contact of the stylus with the surface and whether the system could be used to indicate regions of interest within pages of the book. In the second case (ii), an additional system [21] recognizes book pages and provides the 3D structure of the book page. The 3D endpoints of the stylus are monitored and when one is approximately in contact with the (i) planar surface or (ii) the book, a pertinent event is triggered. Discussion. After running several sessions with more than 20 users a number of positive and negative aspects of the system started emerging, which will be more formally tested in subsequent evaluation sessions. Positive aspects of the system included the following. First, accuracy and response were considered to be adequate for the type of tasks that the participants experimented with. The “desk” experiment yielded error < 3 mm, as to the detection of contact with a surface. Also, employing a non-technological object for interacting with the environment made a very positive impression. The ease of use was deemed high, as it was intuitive and obvious. Finally, participants liked that a single (yet simple) object was be used to control diverse technologies. On the negative side, it was realized that the wand suffers from the “Midas Touch” problem [22]. The user may accidentally issue commands in the
114
X. Zabulis et al.
environment while moving it towards the intended interaction target. Typically, this problem is overcome through the use of “dwell” time, additional explicit commands (e.g., buttons, switches, speech), or gestures. Also, since the wand should be visible by at least two views, there were room regions (i.e. corners) which were not covered by the system.
6
Conclusion
A method that estimates the 3D pose of a wand, despite occlusions and partial wand appearances, has been described and evaluated. Evaluation of the proposed method demonstrates that it is accurate, robust and intuitive in use and that it can be employed in a variety of user applications. Additionally, a multicamera dataset annotated with ground truth was compiled and became publicly available, to facilitate the evaluation of similar methods. Future work warrants multiview line segment matching, to support multiuser interaction. Acknowledgements. This work was supported by the FORTH-ICS internal RTD Programme “Ambient Intelligence and Smart Environments” as well as the European Commission under contract numbers FP7-248258 (First-MM project) and FP7-270435 (JAMES project). Authors thank Manolis I. A. Lourakis and Antonis A. Argyros for fruitful conversations in the formulation of the proposed approach.
References 1. Ishii, H., Ullmer, B.: Tangible bits: towards seamless interfaces between people, bits and atoms. In: CHI, pp. 234–241 (1997) 2. Greenberg, S., Fitchett, C.: Phidgets: easy development of physical interfaces through physical widgets. In: UI Software and Technology, pp. 209–218 (2001) 3. Ballagas, R., Ringel, M., Stone, M., Borchers, J.: istuff: A physical user interface toolkit for ubiquitous computing environments. In: CHI, pp. 537–544 (2003) 4. Simon, A., Dressler, A., Kruger, H., Scholz, S., Wind, J.: Interaction and co-located collaboration in large projection-based virtual environments. In: IFIP Conference on Human-Computer Interaction, pp. 364–376 (2005) 5. Nickel, K., Stiefelhagen, R.: Visual recognition of pointing gestures for human-robot interaction. Image and Vision Computing 25, 1875–1884 (2007) 6. Hile, H., Kim, J., Borriello, G.: Microbiology tray and pipette tracking as a proactive tangible user interface. In: Pervasive Computing, pp. 323–339 (2004) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision (2004) 8. Ayache, N., Lustman, F.: Fast and reliable passive trinocular stereovision. In: ICCV, pp. 422–427 (1987) 9. Quan, L., Kanade, T.: Affine structure from line correspondences with uncalibrated affine cameras. PAMI 19, 834–845 (1997) 10. Baillard, C., Schmid, C., Zisserman, A., Fitzgibbon, A.: Automatic line matching and 3D reconstruction of buildings from multiple views. In: ISPRS Conference on Automatic Extraction of GIS Objects from Digital Imagery (1999)
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction
115
11. Martinec, D., Pajdla, T.: Line reconstruction from many perspective images by factorization. In: CVPR, pp. 497–502 (2003) 12. Moons, T., Fr`ere, D., Vandekerckhove, J., Van Gool, L.: Automatic modelling and 3D reconstruction of urban house roofs from high resolution aerial imagery. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 410–425. Springer, Heidelberg (1998) 13. Woo, D., Park, D., Han, S.: Extraction of 3D line segment using disparity map. In: Digital Image Processing, pp. 127–131 (2009) 14. Duda, R., Hart, P.: Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM 15, 11–15 (1972) 15. Smith, R., Chang, S.: VisualSEEk: a fully automated content-based image query system. In: ADM Multimedia, pp. 87–89 (1996) 16. Lindeberg, T.: Detecting salient blob-like image structures and their scales with a scale-space primal sketch: A method for focus-of-attention. IJCV 11, 283–318 (1993) 17. Lowe, D.: 3D object recognition from single 2D images. Artificial Intelligence 3, 355–397 (1987) 18. Kang, W., Eiho, S.: 3D tracking using 2D-3D line segment correspondence and 2d point motion. In: Computer Vision and Computer Graphics Theory and Applications, pp. 367–380 (2006) 19. Kalman, R.E.: A new approach to linear flitering and prediction problems. Journal of Basic Engineering 82, 35–42 (1960) 20. Koutlemanis, P., Zabulis, X.: (2011), http://www.ics.forth.gr/cvrl/wand/ 21. Margetis, G., Koutlemanis, P., Zabulis, X., Antona, M., Stephanidis, C.: A smart environment for augmented learning through physical books (2011) 22. Jacob, R.: The use of eye movements in human-computer interaction techniques: what you look at is what you get. ACM Trans. Inf. Syst. 9, 152–169 (1991)
Material Information Acquisition Using a ToF Range Sensor for Interactive Object Recognition Md. Abdul Mannan, Hisato Fukuda, Yoshinori Kobayashi, and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570, Japan {mannan,fukuda,yosinori,kuno}@cv.ics.saitama-u.ac.jp
Abstract. This paper proposes a noncontact active vision technique that analyzes the reflection pattern of infrared light to estimate the object material according to the degree of surface smoothness (or roughness). To obtain the surface micro structural details and the surface orientation information of a free-form 3D object, the system employs only a time-of-flight range camera. It measures reflection intensity patterns with respect to surface orientation for various material objects. Then it classifies these patterns by Random Forest (RF) classifier to identify the candidate of material of reflected surface. We demonstrate the efficiency of the method through experiments by using several household objects under normal illuminating condition. Our main objective is to introduce material information in addition to color, shape and other attributes to recognize target objects more robustly in the interactive object recognition framework.
1
Introduction
There is a growing interest in developing service robots that can work in our daily environments such as hospital, office and home. Such service robots need a robust vision system to recognize various objects to carry out their tasks. However, even state-of-the-art vision methods are still not robust enough to perform object recognition without fail. Interactive object recognition is a promising framework to solve this problem. In this framework, robots ask users to provide information about the objects that it cannot recognize. Kuno et al [1] has proposed an interactive object recognition system that can recognize objects through verbal interaction with the user on color and spatial relationship among objects. Besides these attributes we may use material information to indicate target objects such as “Bring me that wooden toy,” or “Give me the paper cup.” This paper proposes a material information acquisition method for interactive object recognition. Since surface optical reflection property is related to object material, we examine surface reflection property with the time-offlight laser range camera. The visual representation of an object’s surface depends on several factors, the illumination condition, the geometric structure of the surface and the surface reflectance properties, often characterized by the bidirectional reflectance distribution function (BRDF) [2-5]. We consider this BRDF to recognize object material. Our material recognition method of 3D free-form objects involves two key tasks: measurement of object surface orientation and reflection pattern analysis. The surface G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 116–125, 2011. © Springer-Verlag Berlin Heidelberg 2011
Material Information Acquisition Using a ToF Range Sensor
117
orientation measurement determines the local surface normal direction with respect to the viewing direction measured by angle. The reflection pattern determines how the local surface reflection intensity distributes with the change of orientation. In 3D object recognition, the key problems are how to represent free-form surface effectively and how to determine the surface orientation. In [6], Besl et al. used mean and Gaussian curvatures to represent surface patch types. They form the base of 3D shape analysis by using differential geometry. Later some researchers [7-9] used this technique to extract geometrical information of local surface. Recently the design of 3D range sensors has received significant attention. 3D data collected by range sensors can provide geometric information about objects. Several researchers have already worked to identify object material in noncontact manner by analyzing surface reflectance properties of the object. In [10] the authors have used several low-level and middle-level features to characterize various aspects of material appearance. Although they use a challenging material database for performing experiments, their accuracy level is still very low. Orun et al [11] have introduced a method that integrates the bundle adjustment technique to estimate local surface geometry and the laser surface interaction to examine the micro structure of material surface. In that experimental setup, they use two sources of laser light and a pair of CCD cameras. Due to these instrumental complexities and their fine adjustment, the method may be inappropriate to use in home environment to recognize household objects by service robots. Moreover, the method needs 2W YAG laser light source that is non-visible and harmful for human eyes. In addition, the research paper does not clarify the material color effect or visible light interference effect with the result. Another active vision technique has been proposed by Tian et al [12], where a structured light based vision system is introduced to investigate surface roughness, defect, and waviness. This method also needs complex instrumental setup and tedious illuminating condition. Furthermore, in [13] researchers have proposed an optical measurement mechanism that enables non-contact assessment of Poisson ratio and effective stiffness of object material. This method uses a laser generated ultrasound probe and the surface damage is very common. Very recently another method has been proposed in [14] to classify the material of real objects by investigating the degree of surface roughness. In this research the authors introduce a noncontact active vision technique by using a time-of-flight range sensor. Although the method yields optimistic results, it has a major limitation that it can work well only for some regular shaped objects and cannot deal with complex shaped objects. In this paper, we propose a method that overcomes those limitations mentioned above. To investigate the surface characteristic we exploit both surface geometrical, micro structural information and its infrared light scattering pattern. In order to estimate the geometric properties of each point on the surface we fit a quadratic surface on the local window centered at each point and then use differential geometry to calculate the orientation and curvature of each local surface patch. Our method has the capability to investigate the surface of any free-form real object of any color. We also propose a light reflection model modified from Torrance-Sparrow model. After analyzing the reflectance properties of surface, the system classifies the objects into several classes according to their surface roughness. The method is applicable for service robots at home environment as well as industrial purposes. This active vision technique uses infrared light as source and only infrared light with certain band of
118
Md. Abdul Mannan et al.
frequencies can reach the sensor. Thus it is not interfered by the visible light. Another major advantage that makes this method more suitable for robot applications is its simplicity. Our proposed scheme only needs a 3D range finder camera and nothing else. Such a time-of-flight range camera has already been used for localization, mapping and object shape recognition in robotics. Hence in robot vision applications the method does not need any extra equipment.
2
Surface Reflection Model
The proportions of electromagnetic energy reflected from the surface of various forms depend upon the nature of the surface or its micro-particle size, and the wavelength of the striking energy. There are two main types of surface reflection: specular reflection and diffuse reflection [15-17]. To analyze the surface reflection pattern we need an appropriate mathematical model that describes the various reflection parameters. Several researchers have already worked on this field to investigate the reflected light pattern from various surfaces [18-19]. The cornerstone of geometric reflectance modeling from rough surface in computer vision N and in computer graphics over the past two decades was the Torrance-Sparrow model [20]. Ψ L This is the most popular model among those that θd aim to incorporate the effect of roughness into θv V the specular reflectance component. The calculation of reflectance is based on geometrical optics. In our study we modify the Torrance-Sparrow model to represent surface reflectance components. In this modified model we neglect the geometrical attenuation and Fresnel terms, instead we add the ambient term as there is a Fig. 1. Reflection geometry possibility of multiple reflections from other objects. Our model is represented by cos
cos
.
(1)
where Iin is the strength of incident light, Ka, Kd, Ks and γ are the ambient reflectance, the diffuse reflectance, the specular reflectance, and the surface roughness parameters, respectively. θd is the angle between the light source vector and the surface normal vector N, θv is the angle between the viewing vector and the surface normal vector, and Ψ is the angle between the half vector H and the surface normal vector as shown in Fig. 1. A small value of γ indicates smooth surface and reflection from rough surface has a bigger γ value.
3
Use of a Time-of-Flight Range Sensor
The proportion of two type reflected lights and their directions are highly dependent upon the surface material type or the surface microscopic characteristic. If the size of
Material Information Acquisition Using a ToF Range Sensor
119
micro-particles or irregularity on the surface is smaller than the wavelength of incident light, then the surface is considered as a complete smooth surface. However, in the case of real world objects, all micro particles on its surface are not in the same size. The proportion of specular and diffuse reflection for a particular surface could be determined by the wavelength of incident light. Hence for a particular light if we estimate the amount of reflected diffuse part and the specular part, we can estimate the degree of surface roughness. Incidentally, to measure the degree of surface roughness we have to select a suitable wavelength light so that it will give the highest discrimenating feature among various surfaces. We choose infrared light because it has the wavelength in midway between visible light and microwave and CCD arrays respond nicely for it. If we use visible light we will get indiscriminating amount of diffuse or specular reflection from various surfaces with significant roughness variation. The same will be true for light of larger wavelength. Furthermore, the visible light does not give color independent reflection. The explanation is given more in [14]. Therefore, we use a 3D range imaging device, SwissRanger4000 [21], which has its own source of infrared light to project on the scene. The device can determine 3D position values of each pixel. The image of the scene is projected on the CCD array equipped inside the camera. The device also has an optical filter in front of its CCD sensor panel to allow only near infrared light to reach the sensor. Visible light from other unwanted sources does not affect the CCD array output. However, the SR4000 has a consistent measurement error (±1 cm) for a distance up to 5 meters. And similar to other sensors using modulated light, the ToF camera suffers from ray-scattering due to inability to distinguish depths that are a multiple of wavelength of the modulated signal; its image frames are susceptible to additional noise, which produces falsified depth maps. The falsified depth maps (noise) perceived on the corrupted frames may discourage the use of ToF cameras to identify object material. This noisy nature of ToF cameras in the above scenario affects the building of realistic maps of object surface, and may hinder the ability of the system to estimate the material accurately. To overcome this type of problem we introduce a technique for the refinement of falsified or noisy depth maps of object surface. We divide a surface of the object into small segments and consider each segment as a quadratic surface. We fit a quadratic surface represented by equation (2) to a small local window (size 5 x 5 pixel) on the object surface and recalculate the depth value of the center pixel. We shift the local window from left to right and up to bottom to recalculate the depth value of each pixel on the surface. Fig. 2 (a) and (b) show distorted and reconstructed surfaces. The received infrared light reflected by a target object contains three components: the specular component, the diffuse component, and the ambient component, which is the gross approximation of multiple reflections from the wall, table and other objects in the scene. In this 3D imaging device, both image sensor and the light source are placed at the same position (θd = θv = Ψ in equation (1)). Thus the sensor receives the maximum reflection from a surface if its orientation directs toward the sensor. If the surface orientation is getting away from this setting, the amount of total received reflection decreases. This decreasing pattern is unique for surfaces of equal roughness and is determined by parameter γ in equation (1).
120
Md. Abdul Mannan et al.
0.67
0.67
0.66
0.66
0.65
0.65
0.64
0.04
0.63
0.64
0.04
0.63
0.02 0.62 0.61 0.1
0.02 0.62
0 0.08
0.06
0.61 0.1
0.04
0.02
0
-0.02
-0.04
-0.06
-0.02
0 0.08
0.06
0.04
0.02
0
(a)
-0.02
-0.04
-0.06
-0.02
(b)
Fig. 2. Depth map of (a) noisy surface and (b) noise free surface
4
Material Classification Methods
To obtain the reflection pattern, we evaluate reflected intensity values I and orientation angles Ψ of surface patches. In our approach, the total surface of an object is divided into small segments or patches. We define “surface patch” as a small region on the surface. Each pixel on the surface surrounded by some other pixels constructs a patch. In order to estimate the geometric information of each patch, at first we fit a quadratic surface to each patch and use the least square method to estimate the parameters of the quadratic surface. Equation (2) represent a quadratic surface where a, b, c, d, e, and f denote the surface parameters. By using differential geometry, we calculate surface normal , Gaussian and mean curvatures K, H and principal curvatures k1,2 [8-9][22]. ,
.
(2)
In our study, we consider only those surface patches as feature patches that do not have large shape variation. The shape variation can be determined by the shape index. The shape index (SI) is a quantitative measurement of surface shape at a point. At any pixel position (i, j) the shape index is defined by equation (3) where k1 and k2 are the maximum and minimum principal curvatures, respectively. With this definition all shapes are mapped into the interval from 0 to 1 [23]. ,
1 2
1
, ,
, ,
.
(3)
Comparatively, convex surfaces have larger shape index values while concave surfaces smaller. Among them plane surfaces have medium shape index values. Therefore the shape index value represents the shape of a patch properly. From these values we select feature patches that are comparatively plane. A result of feature patch selection is shown in Fig. 3 (magnified images), where the feature patches are marked by small squares.
Material Information Acquisition Using a ToF Range Sensor
121
2000
* Actual data point Fitted curve
1800 1600
Intensity
1400 1200 1000 800 600 400
0
10
20
30
40
50
60
70
80
90
Orientation Angle (degree)
Fig. 3. Range image showing feature patches by white squares
Fig. 4. Reflectance pattern and the fitted curve for a paper roll
In order to determine the patch orientation with respect to the viewing or illuminating direction, we calculate the angle δ between the patch normal and the viewing direction by equation (4). cos
·
.
(4)
180°
Normalized Intensity
The viewing direction vector can be 1 represented by the patch center Wood 0.9 vector pc. We compute the intensity Fabric 0.8 value for each patch by averaging the intensity values of the pixels on 0.7 the patch. We can obtain the Paper 0.6 reflection pattern showing the 0.5 relationship between the patch Plastic orientation and the patch intensity as 0.4 shown in Fig. 4. 0.3 We have considered two methods 0.2 to recognize object material from 0 10 20 30 40 50 60 70 80 90 Orientation Angle (degree) such reflection patterns. First we fit a reflection pattern with our newly Fig. 5. The normalized reflection curves introduced modified TorranceSparrow model represented by equation (1). By using the least square method we calculate the surface roughness parameter γ. We call this method the parameter estimation method. The second one is called the pattern classification method. Fig. 5 shows normalized reflection curves for four material classes obtained in our preliminary experiment. We obtain the curves by fitting the Torrance-Sparrow model to the measured data. In the parameter estimation method, we calculate the surface roughness parameter γ from these curves. However, we have found that computed parameter values sometimes
122
Md. Abdul Mannan et al.
vary much even for the reflection curves that appear to be similar. Therefore we have devised a method to classify reflection curve patterns directly into object material categories. We prepare 90-dimensional feature vectors by arranging the intensity values taken at the orientation angles from 0 to 89 degrees in the fitted curves. We construct the Random Forest classifier from these features.
5
Experimental Results
To perform experiments we arranged 14 household objects shown in Fig. 6. All are in different size, shape and color. The objects are divided into 4 material groups like plastic, paper, wood, and fabric. Besides that we also consider another class, but this class is not directly involved in our main experiments. This class consists of those objects that have very smooth and glossy surface like ceramics, steel etc. Due to highly smooth and glossy surface the reflected infrared light from the surface becomes large and the CCD array of SR4000 gets saturated. The device cannot measure the surface depth map accurately for such type of objects. Hence we do not involve those objects in our main experiment; instead we categorize them to an extra class called the glossy class. If the system encounters any object that makes the CCD array of the sensor to be saturated, the system will consider it as a glossy object.
Plastic
Paper
Wood
Fabric Training Objects
Test Objects
Fig. 6. Intensity images of various household objects taken by SwissRanger4000
Material Information Acquisition Using a ToF Range Sensor
5.1
123
Parameter Estimation Method
We performed experiments 9 times for each object to compute the surface roughness parameter γ by the parameter estimation method. Fig. 7 shows the error bars of estimated γ with the maximum and minimum values. Although the estimated parameters generally indicate the surface roughness, the parameter estimation result is a bit unstable, showing large variances. Since the model equation takes a quite complex form, the estimation results change much with a little change of reflection patterns. 5.2
Pattern Classification Method
In our reflection pattern classification experiment, among our 14 experimental objects we took 2 objects from each class and measured 10 reflection patterns for each object to train the system. We then performed recognition experiments 5 times for each rest of the objects to test the method. The recognition rate of the method is 86.7 %. The confusion matrix is shown in Table 1. Here we can say that the recognition rate of the method is quite reasonable, because surface roughness of objects actually varies much for the same material objects. Table 1. Confusion matrix for the pattern classification method (5 cases for each test object)
1 1. Paper object 2. Wood object 3. Fabric object 4. Plastic object
0.9
Value of Gamma
0.8
Plastic
Wood Paper Fabric
0.7 0.6
Plastic
10
0
0
0
Wood
0
9
2
0
Paper
0
1
3
1
Fabric
0
0
0
4
0.5 0.4 0.3 0.2 0.5
1
2
3
4
4.5
Fig. 7. Surface roughness of 4 classes of objects
6
Interactive Object Recognition Using Material Information
Our method alone may not give high material recognition rate. However, the method is useful in the interactive object recognition framework because we can usually reduce the number of possible candidate objects by combining the selection by some other attributes. Fig. 8 (a) and (b) show a simple example case. Here, Object A is made of plastic and its color is gray, Object B is a white paper cup, and Object C is made of white ceramics. The user may first say, “White one,” if she/he wants Object B. The robot can choose Objects B and C as candidates. Then if the user says, “Paper cup, the robot can understand that the Object B is the user’s target object by using our material recognition method.
124
Md. Abdul Mannan et al.
A Plastic
B Paper
(a)
C Ceramic
A Gray plastic
B White paper
C White ceramic
(b)
Fig. 8. (a) Range image of the scene used to identify object’s material and (b) color image used to identify object’s color in the interactive object recognition framework
7
Conclusion
We have proposed a method for identifying object material by considering the degree of surface roughness (or smoothness) using a ToF range sensor. Surface roughness depends on the size of micro particles composing the material. We use a modified version of Torrance-Sparrow model for modeling light reflection. We have demonstrated the feasibility of the method by performing several experiments using fourteen free-shape household objects made of four materials. The range sensor can give surface orientation data and reflectance value. Since the original function of the sensor is to obtain 3D shapes of objects, we can develop an object recognition system with this sensor that can consider object material as well as shape. Human users may ask a robot, “Get that metal box,” or “Give me a plastic box.” Our material recognition method can be useful in such interactive object recognition. We are now developing such a robot vision system. Acknowledgement. This work was supported in part by JSPS KAKENHI (19300055, 23300065).
References 1. Kuno, Y., Sakata, K., Kobayashi, Y.: Objet Recognition in Service Robot: Conducting Verbal Interaction on Color and Spatial Relationship. In: Proc. IEEE 12th ICCV Workshop (HIC), pp. 2025–2031 (2009) 2. Nicodemus, F.: Directional Reflectance and Emissivity of an Opaque Surface. Applied Optics 4(7), 767–773 (1986) 3. Dana, K.J., Van-Ginneken, S.K., Koenderink, J.J.: Reflectance and Texture of Real World Surfaces. ACM Transaction on Graphics 18(1), 1–34 (1999) 4. Jensen, H.W., Marschner, S., Levoy, M., Hanrahan, P.: A practical Model for Subsurface Light Transport. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (2001) 5. Pont, S.C., Koenderink, J.J.: Bidirectional Texture Contrast Function. International Journal of Computer Vision 62(1-2), 17–34 (2005)
Material Information Acquisition Using a ToF Range Sensor
125
6. Besl, P.J., Jain, R.C.: Three-dimensional Object Recognition. ACM Computing Surveys 17(1), 75–145 (1985) 7. Lo, T.-W.R., Paul Siebert, J.: Local Feature Extraction and Matching on Range Image: 2.5D SIFT. Computer Vision and Image Understanding 113(12), 1235–1250 (2009) 8. Bhanu, B., Chen, H.: Human Ear Recognition in 3D. In: Workshop on Multimodal User Authentication, pp. 91–98 (2003) 9. Bayramoglu, N., Aydin Alatan, A.: Shape Index SIFT: Range Image Recognition Using Local Feature. In: International Conference on Pattern Recognition, pp. 352–355 (2010) 10. Liu, C., Lavanya, S., Adelson, E.H., Rosenholtz, R.: Exploring Features in a Bayesian Framework for Material Recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 239–246 (2010) 11. Orun, A.B., Alkis, A.: Material Identification by Surface Reflection Analysis in Combination with Bundle Adjustment Technique. Pattern Recognition Letter 24(9-10), 1589–1598 (2003) 12. Tian, G.Y., Lu, R.S., Gledhill, D.: Surface Measurement Using Active Vision and Light Scattering. Optics and Lasers in Engineering 45(1), 131–139 (2007) 13. Culshaw, B., Pierce, G., Jun, P.: Non-contact Measurement of the Mechanical Properties of Materials Using an All-optical Technique. IEEE Sensors Journal 3(1), 62–70 (2003) 14. Mannan, M. A., Das, D., Kobayashi, Y., Kuno, Y.: Object material classification by surface reflection analysis with a time-of-flight range sensor. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Chung, R., Hammound, R., Hussain, M., Kar-Han, T., Crawfis, R., Thalmann, D., Kao, D., Avila, L. (eds.) ISVC 2010. LNCS, vol. 6454, pp. 439–448. Springer, Heidelberg (2010) 15. Wyszecki, G., Stiles, W.S.: Color Science, 2nd edn. Wiley, New York (1982) 16. Shafer, S.A.: Using Color to Separate Reflection Components Color Research & Application, vol. 10(4), pp. 210–218 (1985) 17. Tominaga, S., Wandell, A.B.: The Standard Surface Reflectance Model and Illuminant Estimation. Journal Optical Society of America A 6(4), 576–584 (1989) 18. Angel, E.: Interactive Computer Graphics: A Top-Down Approach Using OpenGL, 3rd edn. Addison-Wesley, Reading (2003) 19. Phong, B.T.: Illumination for Computer Generated Picture. Communication of the ACM 18(6), 311–317 (1975) 20. Torrance, K.E., Sparrow, E.M.: Theory for Off-Specular Reflection from Roughened Surfaces. Journal Optical Society 57(9), 1105–1112 (1967) 21. http://www.swissranger.com 22. Suk, M., Bhandarker, M.S.: Three-Dimensional Object Recognition from Range Image. Springer-Verlag New York, Inc., Secaucus (1992) 23. Dorai, C., Jain, A.K.: COSMOS-A Representation Scheme for 2D Free-Form Object. IEEE Trans. Pattern Analysis Machine Intell. 19(10), 1115–1130 (1997)
A Neuromorphic Approach to Object Detection and Recognition in Airborne Videos with Stabilization* Yang Chen, Deepak Khosla, David Huber, Kyungnam Kim, and Shinko Y. Cheng HRL Laboratories, LLC, Malibu, CA 90265
Abstract. Research has shown that the application of an attention algorithm to the front-end of an object recognition system can provide a boost in performance over extracting regions from an image in an unguided manner. However, when video imagery is taken from a moving platform, attention algorithms such as saliency can lose their potency. In this paper, we show that this loss is due to the motion channels in the saliency algorithm not being able to distinguish object motion from motion caused by platform movement in the videos, and that an object recognition system for such videos can be improved through the application of image stabilization and saliency. We apply this algorithm to airborne video samples from the DARPA VIVID dataset and demonstrate that the combination of stabilization and saliency significantly improves object recognition system performance for both stationary and moving objects.
1 Introduction Object or target recognition in aerial videos has been a topic in machine vision research for many years. The traditional approach to this problem involves a two-step process: (1) detecting moving objects and tracking them over a certain number of video frames to select one or more regions of interest (ROI) in the frames, and (2) applying an object recognition algorithm on these ROIs, which may be bounding boxes or tight-fitting polygons. Unfortunately, this approach is limited in that it can only detect and recognize moving objects. Most applications with aerial videos involve both static and moving objects; thus, the use of both form and motion features is required to adequately detect all objects. The brute-force solution to the recognition problem from a moving platform involves performing raster scan recognition over the entire frame so as to cover both static and moving objects, which suffers from a high processing load. Also, depending on the recognition method selected, it may be necessary to process the images at several scales (e.g., HMAX [1,2]), further increasing the processing load. There is a need for fast and robust algorithms that detect potential ROIs with static and moving objects in aerial videos with high accuracy, which can then be processed by the *
This work was partially supported by the Defense Advanced Research Projects Agency NeoVision2 program (contract No. HR0011-10-C-0033). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressly or implied, of the Defense Advanced Research Projects Agency or the U.S. Government.
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 126–135, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Neuromorphic Approach to Object Detection and Recognition in Airborne Videos
127
recognition algorithm. An ideal algorithm is one that detects only ROIs corresponding to true objects (i.e., no false alarms), providing the downstream recognition algorithm with the maximum chance of success. Neuromorphic attention algorithms, such as feature- or object-based saliency [3-7], can be used to find and extract regions of interest from video imagery. These algorithms process a scene and detect anomalies in its structure, such as sharp contrasts in color or intensity, strange geometries (such as a vertical element in a horizontally-dominated scene), or parts of the scene that appear to change with time (moving objects or things that appear to flicker) and return a result in the form of a “saliency map”, which indicates how interesting or distinct a given region of the scene is. Feature-based saliency algorithms process the scene pixel-by-pixel and find specific small regions that “stand out” against the rest of the scene. An example of this type of attention model is the NVT algorithm [3] and algorithms based on the Quaternion Fourier Transform [4] or spectral residual of the image [5]. This model of attention has often been described as a spotlight that focuses attention on a specific part of the scene without any concept of what it is actually highlighting. Typically, the spotlight is set to be some predetermined size that is larger than the expected object size, and the entire region is excised for further analysis. An alternative to the feature-based saliency algorithm is the object-based approach, which attempts to extract entire objects from the scene based on continuous expanses of salient features. Like the feature-based approach, these algorithms process an image and extract regions that stand out from the rest of the scene. However, instead of acting like a spotlight, these algorithms employ the feature maps as a means to determine the object boundary. Consequently, this approach is able to segment complete objects from the scene. Examples of object-based saliency algorithms are the work of Orabona, et al. [6] and Huber and Khosla [7]. It has been previously shown that employing an attention algorithm as a front-end to a recognition system can dramatically improve object recognition results, both through increased correct detections and lower false alarms [8-10] when the camera is stationary. In this instance, an attention algorithm is applied to the frames in a video sequence and regions of interest (ROI) are extracted based on their saliency, which are used as cues and fed into the object recognition algorithm. By combining a biologically-inspired attention algorithm, which can detect both moving and stationary objects, with a biologically-inspired recognition algorithm, one can form a powerful visual recognition engine without going through the traditional detect-andtrack paradigm. This permits the detection and recognition of both moving and stationary objects at higher speed than with traditional approaches. However, current attention algorithms are only effective in stationary scenes; saliency maps obtained from a moving platform, as is the case with aerial videos, often contain a great deal of noise and produce a large number of “false alarms” corresponding to background features that do not correspond to objects in the scene. These errors are likely due to the egomotion of the camera conflicting with the motion detection of the saliency algorithm. Our analysis shows that these algorithms cannot differentiate between camera motion and object motion in the scene. This is a severe limitation in the application of saliency as a front-end for object recognition systems, since much surveillance video is obtained from moving aerial platforms. In light of
128
Y. Chen et al.
the improvement in the results in [8], it is critical that a method of computing saliency on moving platforms be developed. In this paper we describe an architecture that performs object recognition in videos from a moving platform, and can detect both moving and stationary objects by using bio-inspired attention and recognition algorithms. We preprocess the aerial videos with video stabilization, which allows the images of the ground objects to be easily detected as salient points by the attention algorithm without suffering from motioninduced clutter. We extract an image chip (i.e., ROI), which can be a fixed size bounding box or a tight-fitting object shape computed using the same features [10], and apply a bio-inspired object recognition algorithm. We demonstrate that this architecture significantly improves performance in terms of recognition rate/false alarm metric as validated on VIVID aerial video dataset.
2 Method For this work, we employ a three-stage approach to object recognition, which is discussed in detail in this section. First, we apply a video stabilization function, which finds the spatial transformation that can be used to warp video images in neighboring frames into a common coordinate system and eliminate the apparent motion due to sensor platform movement. Next, we apply a neuromorphic attention algorithm to the stabilized video images and produce a set of locations in the images that are highly likely to contain objects of interest. The bio-inspired feature extraction function takes a small image chip (i.e., the ROI) around each salient point and extracts highdimensional feature vectors based on models developed following human visual cortex. These features are used by the classification engine that employs an algorithm such as a Support Vector Machine (SVM), to either classify the features into an object class or reject the image chip. 2.1 Video Stabilization The purpose of video stabilization is to compensate the motion in the video images caused by the motion of the camera and/or its platform. Our method of image stabilization consists of four steps, feature detection, matching, image transformation estimation and image warping. We use the Scale Invariant Feature Transform (SIFT) as feature descriptor, which is invariant to scale, orientation, and affine distortions, to extract key points for the image. Key points are defined as maxima and minima of the result of difference of Gaussians function applied in scale-space to a series of smoothed and re-sampled images. Dominant orientations are assigned to localized key points. SIFT feature descriptors are 128-dimensional vectors representing the gradient orientation histograms and can be used to compare if two image key points are similar (i.e., they are from the same point in the scene). Feature matching compares the two sets of SIFT features and match the key points from one image to the other that have similar SIFT features. This results in a list of candidate set of matching points from the two images to be filtered in the next step. A match for a key point in one image is defined as the key point in the other image with the minimum Euclidean distance based on the descriptor vectors of the key points.
A Neuromorphic Approach to Object Detection and Recognition in Airborne Videos
129
The list of matching points obtained this way is not very reliable in that incorrect matches can happen due to noise and inherence capability of SIFT descriptor in distinguishing certain type of key points. To achieve more reliable matching, we apply RANSAC (Random Sample Consensus) which is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers. We use RANSAC to find a homography transform (model) that fits the candidate set of matches. As a result we get a set of correct matches as well as an accurate transformation (homography) between the two images. The final step in video stabilization is to warp the image frames into a global coordinate frame so that the warped images show no platform induced image motion. In a “blocked” mode of operation, we choose a block size of N frames in which each frame is warped to the first frame in the block using the homography transformation found as described above (e.g., frames 1, …, N are warped into the coordinate system of frame 1; frames N+1, …, 2N are warped into frame N+1, and so forth). This way, the images within each block are stabilized with respect to the first frame of the block, while the images between blocks are not stabilized. Alternatively, in a “non-blocked” mode of operation, we warp the previous image frame for every new input frame (the current frame) so that the pair of current and previous images are always registered for the attention algorithm. Both approaches allow camera motion without having to maintain a large global image frame buffer. In our experiments, we produced the stabilized versions of our input aerial videos in block mode with a block size of 10. The block size should be determined by the video frame rate and the platform speed and altitude. Our videos were taken at 30 fps (altitude = 800–1200 meters; speed = 40-70 meters/sec). If the scene doesn’t change much, one can use larger block sizes. Otherwise, the block size should be smaller to ensure proper overlap among the images in the same block. 2.2 Neuromorphic Attention for Object Detection Following video stabilization, we apply a bio-inspired visual attention algorithm similar to [7] to detect locations in the video images that are likely to contain objects of interest. While the literature is rich with different methods (e.g., [3-7]), most saliency algorithms work in the same basic manner: accepting two consecutive frames as input at any given time and outputting a saliency map, which indicates how “interesting” a given location in the frame is relative to its surroundings. This is done by assigning a score to each spatial location in the image that measures its variance from the rest of the image. Saliency algorithms generally contain one module for static object detection and another for finding moving objects. For static detection, the image data for the current frame is decomposed into channels that correspond to color and intensity; red, blue, green, yellow, and luminance are commonly used, which are processed as opposing pairs with a positive “center” receptive field of one color and a negative “surround” receptive field of its opponent color. This center-surround color opponency mimics the processing of the mammalian visual system and allows the system to find strong color opposition in the scene. Color opponency maps are computed for the red/green and blue/yellow pairings by performing the convolution of each color channel with a narrow-band “center” Gaussian kernel and a wide-band “surround” Gaussian kernel. Each surround result is subtracted from its appropriate center result for each color pairing, providing
130
Y. Chen et al.
four color opponency maps: redC/greenS, greenC/redS, blueC/yellowS, and yellowC/blueS. Similarly, center-surround maps for orientation and intensity are computed by convolving the luminance channel with narrow- and wide-band Gabor and Gaussian filters, respectively. The orientation channel detects geometrical anomalies, such as a single horizontal element in a primarily vertical field, while the intensity channel picks up spots of dark or light against an opposing backdrop. Because these features employ a single frame, the motion of the platform is likely to have little effect on the results. Motion processing in a saliency algorithm is carried out by computing the difference between the intensity channels for two consecutive frames for offset in five directions (up, down, left, right, and in-place, or zero-offset). These channels detect change between the two frames and pick up on motion or, in the case of the zerooffset channel, what appears to be flickering of light in the scene. Because these channels use a pair of frames for processing, scenes from a moving platform can cause these channels to provide spurious or false results due to the algorithm confusing stationary features that appear to move with actual moving objects. A saliency map is constructed from the weighted contributions of the four color opponency maps, the intensity map, the four orientation maps, and the five motion maps by a sequence of addition and normalization of maps that correspond to common features. For object recognition, we extract the peaks from the saliency map that the algorithm returns, obtaining a list of locations in the image. In theory, these are the regions that the human eyes are likely to attend to that correspond to objects of interest. The peak threshold is set sufficiently low that all possible targets are detected, (i.e., no false negatives). We seed the visual recognition engine with the image chips or ROIs (128x128 regions extracted from the image) that are centered at these peaks. 2.3 Biologically-Inspired Visual Recognition HMAX (or CBCL) is a feed-forward model of mammalian visual cortex [1, 2] that has been validated to perform similarly as humans do in fast object recognition tasks. At the heart of this model is hierarchy of alternating layers of filters simulating simple and complex cells in the mammalian visual cortex. The simple cells perform template matching, while the complex cells perform max-pooling and subsampling, which achieves local invariance to shift. As the algorithm moves to the higher layers, the features become more invariant with a wider receptive field. At the top layer, this model outputs a vector of high-dimensional features typically ranging in size from hundreds to a few thousand elements that can be used to classify the input image from the bottom layer. In our experiments, we used a model similar to that described in Mutch and Lowe [11], but with a base image size of 128x128 and 8 layers of image pyramid. 200 random C1 patches were used, which are sampled from a set of training images of similar scenes as in our aerial video images. This results in a feature vector of 200 dimensions for each input image of 128x128. To complete the HMAX/CBCL based visual recognition engine, a set of labeled training images that includes both objects of interest and background clutter are presented to the HMAX/CBCL model and the resulting feature vectors are used to
A Neuromorphic Approach to Object Detection and Recognition in Airborne Videos
131
train a Support Vector Machine (SVM) classifier. Once trained, the SVM classifier can be operated on-line in the system to provide image classification (such as a vehicle, bike, pedestrian or background) with a confidence value. We employ the SVM classifier as a convenience which also has been proven to perform well for a variety of classification tasks. However, any multi-class classification method that can handle high-dimensional feature would be sufficient.
3 Results and Discussion We validated the stabilization-saliency methodology that we present here using a combination of CPU/GPU implementations of the three modules discussed in Section 2. The algorithms were applied to the DARPA VIVID dataset, which consists of a series of color videos taken from a moving aerial platform (Figure 1). There are a number of object types present in these videos, including vehicles (cars and trucks), motorcycles, and pedestrians. In each video, potential objects can be in plain view or partially occluded; in most cases the objects are moving. For our experiments, we first ran the base-line system that involves the application of saliency without video stabilization. We trained the HMAX/CBCL and the SVM classifier using sample object images from a set of 6 training videos, each containing between 1800 and 1900 frames and tested on a different set of 6 videos than those used in training. Then we retested the system with the same test data after stabilizing the videos in blocked mode with block size N=10.
Fig. 1. Sample images from DARPA VIVID Dataset. This dataset contains color videos of various objects such as vehicles, motorcycles, and pedestrians at 640x480 resolution.
Our first objective was to determine the specific reasons that the saliency algorithm perform poorly on videos from a moving platform. We ran the saliency algorithm on the unstabilized VIVID videos and saw a significant drop in object detection performance over what we would have expected if the video were shot from a stationary camera. Figure 2 curve (a) shows the receiver operating characteristic (ROC) curve for this trial, and illustrates the probability of object detection (Pd) as a function of false positives per image (FPPI). Here Pd is defined as the ratio of number of salient chips (section 2.2) having non-zero intersections with the target
132
Y. Chen et al.
bounding boxes to the number of ground truth targets (regardless of class) in each image, averaged over all images in the test sequences. False positives are those salient chips that do not intersect with any target bounding boxes. FPPI, an average over all image frames, is used instead of traditional false positive rate (or FAR) because FPPI directly translates to the number of false positives per unit time given the video frame rate, which is a preferred measurement of false alarms for image analysts.
Fig. 2. Object detection performance based on saliency with and without motion and flicker channels for sample videos from VIVID data set. (a) The saliency algorithm performs poorly on unstabilized videos when motion and flicker channels are used by the saliency algorithm. (b) When flicker and motion channels are not used, the performance of saliency is restored to certain extent. (c) When the video is stabilized, the full saliency algorithm achieves the best performance. (d) When motion channels are not used, saliency performance on stabilized videos is similar to that on unstabilized videos. The horizontal axis indicates the false positive per image (FPPI) (see text for explanation).
Suspecting that the algorithm was picking up extraneous saliency signatures from the egomotion of the camera (i.e., frame to frame motion due to camera motion boosted certain image features to have unusually high saliency scores), we ran the trial again with the motion channels disabled and saw a significant increase in performance (Figure 2, curve (b)), though not as good as the full saliency algorithm from a stationary camera. This clearly shows that the motion channels are rendered impotent by the image motion due to platform movement, and the overall detection results suffer as a consequence of false alarms that effectively swamp the other feature maps (e.g., intensity, color, orientation). This is likely due to the way that the saliency algorithm processes motion. By differencing the intensity maps of consecutive frames, the
A Neuromorphic Approach to Object Detection and Recognition in Airborne Videos
133
saliency algorithm detects motion by changes in the intensity patterns of the image frames in various directions over time. However, this method only works locally and does not notice bulk, consistent motion of all objects in a frame caused by a moving camera. Therefore, the saliency algorithm cannot differentiate between a moving object viewed by a stationary camera and a stationary object viewed by a moving camera because all it sees are blobs that appear to move within the frame. By removing the motion channels from the saliency calculation, we eliminate a major source of noise, which provides the observed marginal improvement in the probability of detection. From this preliminary analysis, we are able to infer that the moving platform ultimately leads to the loss of the effectiveness of the motion channels in the saliency algorithm. Since the motion processing in a saliency algorithm works on pairs of consecutive images, they should be stabilized with respect to one another prior to processing by saliency; a method of image stabilization that takes care to make the images look stationary to the saliency algorithm is a likely solution to this problem. We applied the stabilization method described in 2.1 to the same VIVID videos and repeated the trials for the saliency algorithm with and without motion channels. These results are displayed as curves (c) and (d) in Figure 2. The benefit from stabilizing the image is immediately apparent; this result provides a large benefit over its unstabilized analogue. However, what is interesting is how closely the results for the saliency on the stabilized and the unstabilized videos without using motion components correlate with one another. This indicates that the static components of saliency algorithm behave nearly identically in both cases and it validates the hypothesis that the motion channels suffer when the video frames are not stabilized, which degrades the system performance.
Fig. 3. Comparison of salient points using motion unstabilized (left) and stabilized (right) videos. The moving camera picks up on spurious high contrast areas on the ground (left), which disappear (right) when the video is stabilized prior to saliency processing.
Figure 3 shows the ROI provided by the saliency algorithm in unstabilized (left) and stabilized (right) input videos. All regions that exceed a given threshold in each image are defined as ROIs and denoted by a box. As can be seen, the most salient objects in the stabilized scene all correspond to vehicles, whereas the ROIs in the unstabilized video are more dispersed due to the platform motion. This validates our
134
Y. Chen et al.
hypothesis that the saliency algorithm is swamped by spurious motion signals when the camera is moving that prevent actual moving targets from being detected. In this case, patches of light-on-dark contrast on the ground appear to move in the unstabilized imagery, which produce a stronger saliency signal than the moving vehicles in the scene (due to higher overall contrast). However, when the scene is stabilized prior to applying the saliency algorithm, these patches no longer appear to move and saliency is able to detect the vehicles. To quantify the benefits of better target detection performance to the final object recognition system performance, next we ran the classifier on the salient chips provided for the stabilized and unstabilized VIVID videos and summarized the results as ROC curves (Figure 4). Here the SVM classifier was trained on 3 target classes (vehicle, bike and pedestrian) plus the background class using samples from the 6 training sequences and applied to the salient chips from the test sequences.
Fig. 4. Performance of the HMAX/CBCL-SVM based object recognition system with and without video stabilization. (a) System with unstabilized video based on ROIs provided by full saliency; (b) Stabilizing the video greatly improves the recognition system performance; (c) even when flicker and motion channels are not used by the saliency algorithm, video stabilization can still boost overall system performance. The horizontal axis measures the false positive per image (FPPI) (see text for Figure 2. for explanation).
As can be seen from Figure 4, the system with video stabilization performs much better than it does without video stabilization (the performance is better if the ROC is towards the top and left, meaning higher recognition rate and lower false alarms). This shows that the better detection performance shown in Figure 2 translates to performance benefits in object recognition of the overall system.
A Neuromorphic Approach to Object Detection and Recognition in Airborne Videos
135
4 Conclusion The application of a saliency algorithm as a front end to an object recognition system can improve overall system performance. However, this advantage is greatly compromised when the camera used to capture the video is attached to a moving platform, due to image motion caused by platform movement. In fact, the motion processing portion of the saliency algorithm is not only wasted, but also harmful to system performance. We have shown in this paper that employing an image stabilization process prior to the application of the saliency algorithm can restore the effectiveness of the motion channels of the saliency algorithm and achieve a significant improvement in performance for object detection and recognition. Furthermore, as a practical guideline, when video stabilization is unavailable or infeasible to implement, saliency algorithm works better if the motion channels are disabled in the saliency algorithm.
References 1. Serre, T., Poggio, T.: A Neuromorphic Approach to Computer Vision. Communications of the ACM (online) 53(10), 54–61 (2010) 2. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust Object Recognition with Cortex-Like Mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 29(3), 411–426 (2007) 3. Itti, L., Koch, C.: A Saliency-Based Search Mechanism for Overt and Covert Shifts of Visual Attention. Vision Research 40, 1489–1506 (2000) 4. Guo, C., Ma, Q., Zhang, L.: Spatio-temporal Saliency Detection Using Phase Spectrum of Quaternion Fourier Tranform. In: Proc. CVPR, pp. 1–8 (2008) 5. Hou, X., Zhang, L.: Saliency Detection: A Spectral Residual Approach. In: Proc. CVPR, pp. 1–8 (2007) 6. Orabona, F., Metta, G., Sandini, G.: A Proto-object Based Visual Attention Model. In: Paletta, L., Rome, E. (eds.) WAPCV 2007. LNCS (LNAI), vol. 4840, pp. 198–215. Springer, Heidelberg (2007) 7. Huber, D., Khosla, D.: A Bio-Inspired Method and System for Visual Object-Based Attention and Segmentation. In: Proc. SPIE DSS, vol. 7696 (2010) 8. Chikkerur, S., Serre, T., Poggio, T.: Attentive Processing Improves Object Recognition. Massachusetts Institute of Technology Technical Report: MIT-CSAIL-TR-2009-046 (2009) 9. Rutishauser, U., Walther, D., Koch, C., Perona, P.: Is Bottom-Up Attention Useful for Object Recognition? In: Proc. CVPR, vol. 2, pp. 37–44 (2004) 10. Walther, D., Koch, C.: Modeling Attention to Salient Proto-Objects. Neural Networks 19, 1395–1407 (2006) 11. Mutch, J., Lowe, D.: Multiclass Object Recognition with Sparse, Localized Features. In: Proc. CVPR, pp. 11–18 (2006)
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures Roberto Lam and J.M. Hans du Buf Institute for Systems and Robotics (ISR) Vision Laboratory - University of the Algarve (ISE and FCT) 8005-139 Faro, Portugal
Abstract. In this paper we present a method for retrieving 3D polygonal objects by using two sets of multiresolution signatures. Both sets are based on the progressive elimination of object’s details by iterative processing of the 3D meshes. The first set, with five parameters, is based on mesh smoothing. This mainly affects an object’s surface. The second set, with three parameters, is based on difference volumes after successive mesh erosions and dilations. Characteristic feature vectors are constructed by combining the features at three mesh resolutions of each object. In addition to being invariant to mesh resolution, the feature vectors are invariant to translation, rotation and size of the objects. The method was tested on a set of 40 complex objects with mesh resolutions different from those used in constructing the feature vectors. By using all eight features, the average ranking rate obtained was 1.075: 37 objects were ranked first and only 3 objects were ranked second. Additional tests were carried out to determine the significance of individual features and all combinations. The same ranking rate of 1.075 can be obtained by using some combinations of only three features.
1
Introduction and Related Work
The increasing availability of 3D models due to technological developments allows us to use increasingly complex illustrations. Tridimensional digital scanners produce 3D models of real objects. CAD software can also produce 3D models, from complex pieces of machinery with lots of corners and edges to smooth sculptures. Very complex protein structures play an important role in pharmacology and related medical areas. The World Wide Web allows to incorporate 3D models in sites and home pages. As a consequence of this trend, there is a strong interest in methods for recognition and retrieval of 3D objects [1,2]. Object recognition (matching) may be very time consuming because of all variations that may occur: different position (object origin), rotation, size and also mesh resolution. Similarity analysis does not require precise shape comparisions, global nor local. Normally, this approach is based on computing a set of features or a feature vector FV of a query object and comparing it with the FVs of all objects in a database. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 136–147, 2011. c Springer-Verlag Berlin Heidelberg 2011
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures
137
The FVs can be obtained by a variety of methods, from very simple ones (bounding box, area-volume ratio, eccentricity) to very complex ones (curvature distribution of sliced volume, spherical harmonics, 3D Fourier coefficients) [3,4,5]. The intrinsic nature of the objects may pose some constraints, and some methods may be more suitable, and faster, for the extraction of FVs than others. For example, methods based on spherical harmonics and 3D Fourier coefficients are not suitable for concave (non-star-shaped) objects, whereas other methods have problems with open (non-closed) objects. Some limitations can be solved by combining two or more methods. However, since many objects can yield very similar FVs by applying only one method, i.e., mathematically possibly an infinite number of objects, normally several methods are combined to achieve the best results. We mention the approach of [6], which is related to our own approach: they projected a 3D object onto 2D curvature maps. This is preceded by smoothing and simplification of the polygonal mesh, and final retrieval is based on comparing the 2D curvature maps. The theory of mathematical morphology (MM) arose in the middle of the 1960s [7,8]. Developed for geometric analyses of shapes and textures, it became increasingly important in 2D image processing and computer vision. Despite all theoretical developments and generalization to 3D, most MM work is still being applied to 2D image processing [8]. The work done in 3D is rather scarse and mostly limited to three-dimensional surfaces. Jackway [9] developed an approach for the recognition of 3D objects in range data through the matching of local surfaces. Lee et al. [10] analyzed the composition of 3D particle aggregates by processing one hemisphere of the particles. In this paper we also apply MM to recognition of 3D polygonal objects, but in combination with another method, i.e., mesh smoothing. The rest of this paper is organized as follows: Section 2 presents the proposed methods and Section 3 the experimental results. We conclude with a discussion in Section 4.
2
Overview of Our Approach
We use 40 objects of the AIM@SHAPE database [11]. Each one is represented by four different mesh resolutions. The models were downloaded in PLY format and they are 2-manifold, ”watertight” (closed, without gaps and with regular meshes). Figure 1 shows some models and Table 1 lists all the objects and their mesh resolutions. The first three resolutions are used for creating the characteristic FV and the last resolution is used for testing in similarity search. In order to obtain invariance to scale (size) and translation, the models were normalized to the unitary sphere after the origin of object was moved to the center of the sphere. Rotation invariance is achieved by the fact that our FV is global to the model as proven in [12]. Invariance to mesh resolution is obtained by proper feature normalization, which is explained below. We apply two different methods which complement each other. Mesh smoothing affects the object’s area (Section 2.1) and the dilation-erosion method affects the object’s volume (Section 2.2).
138
R. Lam and J.M.H. du Buf
Fig. 1. Examples of models. From left to right: Elk, Mouse, DancingChildren, Dragon, Egea and RollingStage with increasing model resolutions. Table 1. All 40 models with their mesh resolutions; the first three are used in resolution-invariant feature extraction, the last one is used in similarity search N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2.1
Model Amphora Bimba Blade Block Bunny CamelA Carter Chair Cow2 Cow Dancer DancingChildren Dente Dilo Dino Dragon Duck Egea Elk Eros
Resolutions 6.5; 7.5; 9.5; 6.0; 8.5; 9.5; 6.0; 7.5; 9.9; 5.0; 6.5; 8.0; 6.5; 7.5; 9.9; 6.0; 7.5; 9.9; 6.0; 7.5; 9.9; 6.0; 7.5; 9.9; 6.0; 7.5; 9.9; 6.0; 6.4; 9.9; 6.0; 7.5; 9.9; 6.0; 7.5; 9.9; 6.0; 7.5; 9.9; 6.0; 8.5; 9.6; 6.0; 8.3; 9.7; 6.0; 8.0; 9.5; 6.0; 7.5; 9.9; 7.4; 7.9; 9.5; 6.0; 7.5; 9.9; 6.0; 7.5; 9.9;
8.0 8.0 8.0 8.5 8.0 7.8 7.3 6.9 8.9 7.1 7.7 6.8 7.0 7.7 7.7 7.7 6.7 8.7 7.9 6.5
N 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Model Resolutions Fish 6.0; 7.5; 9.9; FishA 6.0; 7.5; 9.9; Grayloc 6.0; 7.5; 9.9; GreekSculpture 6.5; 7.0; 7.7; Horse 6.0; 7.5; 9.9; IsidoreHorse 6.0; 7.5; 9.9; Kitten 6.0; 7.5; 9.9; Liondog 6.0; 7.5; 9.9; Maneki 6.0; 8.8; 9.8; Moai 6.5; 8.5; 9.5; Mouse 6.0; 7.5; 9.9; Neptune 6.0; 7.5; 9.9; Pulley 6.0; 7.5; 9.9; Ramesses 6.0; 7.5; 9.9; Rocker 6.0; 7.5; 9.9; RStage 6.0; 7.0; 9.0; Screwdriver 6.0; 7.5; 9.9; Squirrel 6.0; 7.5; 9.9; Torso 6.0; 7.5; 9.9; Vaselion 6.0; 7.5; 9.9;
8.0 7.0 7.8 8.5 8.0 7.0 7.3 8.0 7.5 9.7 7.8 7.6 7.0 8.0 7.1 9.5 7.0 7.2 7.7 7.5
Mesh Smoothing
Mesh smoothing is usually used to reduce noise. [13] smoothed principal components for shape classification in 2D. In our work the main aim is related to iterative and adaptive (nonlinear) mesh smoothing in 3D. Smoothing in quasi-planar regions but not at sharp edges was used in [14] for reducing the number of vertices. Here we simply apply the linear version which will smooth the mesh at all vertices. It starts by eliminating very sharp object details, like protruding dents and bumps, and then after more iterations less details will remain. The sum of the displacements of all vertices, combined with the contraction ratio of the surface area, generates a quadratic function that characterizes the model quite well.
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures
139
If Vi , with i = 1, N , is the object’s vertex list with associated coordinates (xi , yi , zi ), the triangle list T (V ) can be used to determine the vertices at a distance of one, i.e., all direct neighbor vertices connected to Vi by only one triangle edge. If all neighbor vertices of Vi are nVi,j , with j = 1, n, the centroid of the neighborhood is obtained by V¯i = (1/n) j=1 Vi,j . Each vertex Vi is moved to ¯ i = ||Vi −V¯i ||. Figure 2 shows a model and the influence of V¯i , with displacement D N ¯ mesh smoothing. The total displacement is D = i=1 D i . The entire procedure is repeated 10 times, because we are mainly interested in the deformation of the object at the start, when there still are many object details, and more iterations do not add l useful information anymore. Hence, displacements are accumulated by Al = m=1 Dm with m = 1...10. In order to obtain invariance to mesh size, in each iteration m the displacement Dm is corrected using Dm := Dm ·
N Pm · N , A10 · Sm
(1)
with N the total number of vertices, N Pm the number of participating vertices (in non-planar regions which contributed to the displacement), Sm the surface of the object (sum of all triangles) after each smoothing step, and A10 the final, maximum accumulated displacement after all 10 iterations. Then the curve of each object and each mesh resolution is further normalized by the total contraction ratio C = S10 /S0 (final surface and original surface), and the three curves (10 data points) are averaged over the three mesh resolutions. In the last step, the averaged Al is least-squares approximated by a quadratic polynomial in order to reduce 10 parameters to 3. Figure 3 shows representative examples of curves Al . It should be stressed that, in contrast to the second method as described below, no re-triangulation of the object’s mesh after each iteration is done, i.e., the number of vertices—and triangles—remains the same.
Fig. 2. Mesh smoothing applied to IsidoreHorse model. From left to right: original and smoothed meshes after 3, 6 and 10 iterations.
2.2
Dilation and Erosion
As in the previous section (2.1) and in [15], the basic idea of this method is to characterize 3D objects by controlled elimination of detail. This is illustrated in 2D in Figure 4. The top of the figure shows a triangle and a square with the
140
R. Lam and J.M.H. du Buf
Fig. 3. Characteristic curves after mesh smoothing of the Bimba and IsidoreHorse models
Fig. 4. Top: Erosion and dilation in 2D of equilateral triangle (left) and square (right) using a circle with radius r as structuring element. Bottom: Area β as a function of radius r of the structuring element, equilateral triangle (left) and square (right)
structuring element, a circle with radius r, on the corners of the original objects. The dilated objects are bigger (only the contours are shown) and the eroded objects (shown shaded) are smaller. The surface β between both as a function of the radius r is shown in bottom: the two curves are linear but have different slopes. This effect will be exploited below in the 3D case [16]. There are a few important issues when applying mathematical morphology to 3D objects. One is associated with the type of representation: voxel or mesh [17,18]. The voxel representation involves 3D arrays with, depending on the object’s resolution, very big dimensions, although the voxels themselves are binary: object vs. background. An advantage is that many algorithms from mathematical morphology have been developed for 2D image processing, and these can easily be adapted to 3D. Polygonal meshes, on the other hand, have a more complex data structure. After applying the erosion and dilation operators, the new meshes must be determined, very close vertices can be collapsed, and self-intersecting facets must
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures
141
be detected and removed. In our method we extend boundary extraction [8] from 2D to 3D. Due to the fact that we use polygonal meshes we can apply a similar solution. If Ac = 1\A is the set outside A, then β(A) = Ac ∩ (A ⊕ B) + A ∩ (A B)c
(2)
is the sum of the expanded and shrunken volumes relative to the original volume, i.e., the difference volume. In order to limit distortions in the transformations, we use a sphere of which the radius r is a function of edge lenght. To avoid ˆ inconsistencies between different mesh resolutions, we select r = L/20, where ˆ L is an object’s edge length with the maximum occurrence. This can be easily determined by filling a lenght histogram with 50 equal bins from Lmin to Lmax of each object. Dilations are obtained by displacing all vertices a distance r (the radius) in the direction of the normal vector. Since normal vectors always point outside, this is −r in the case of erosions. Both operators are applied in two sucessive steps. The first step is intended to obtain the volumes of the objects after an initial erosion/dilation process. Each operator is repeatedly applied until the first self-intersection occurs. In this step we do not remove any element of the mesh, vertex nor facet. In the second step we use the dilated (biggest) and the eroded (smallest) objects, generated in the first step, as a new starting point. The operators are repeatedly applied to the corresponding object: erosion to the smallest and dilation to the biggest object. After each erosion/dilation, we search the mesh for vertices that have a neighbor vertex in their vicinity, i.e., in the sphere with radius r centered at the vertex being processed, Vp . If there is a candidate vertex, Vc , it must be connected to Vp by at most 3 edges but it may not possess a direct edge to Vp . This restriction must be satisfied in order to keep the mesh 2-manifold. The search for the vertices with the shortest path from Vp to Vc is done by using Dijkstra’s algorithm. Vertices Vp and Vc are merged by removing all edges and vertices, which causes a gap in the mesh, and then by inserting a new vertex, Vf , with coordinates equal to the average of the removed vertices. In the last step Vf is connected to the vertices forming the gap; see Fig. 5.
Fig. 5. Merging neighboring vertices: before (left) and after (center). The triangles around vertex A will self-intersect during erosions, and those around B during dilations (right).
142
R. Lam and J.M.H. du Buf
The elimination of self-intersecting facets is also necessary in situations where the nearest vertex is out of the vicinity sphere, the structuring element. The right side of Fig. 5 shows two situations which both lead to a self-intersection. Elimination is done using the TransforMesh Library [19], without introducing any additional deformation. The application of a sphere as structuring element to all vertices yields a smaller object in case of erosion and a bigger one in case of dilation. The Horse model, for example, after repeated erosions will have discontinuity of the legs; see Fig. 6. The small stumps and their volumes are excluded from the computation of the Horse’s parameters. The same procedure is applied to the other models. According to Eq. 2, the difference volume is defined as dilated volume minus
Fig. 6. Horse model: original (left), after erosion (center) and dilation (right). Mesh resolutions of 6.0 (top) and 7.5 (bottom).
eroded volume, and this yields a linear function of the radius of the structuring element; see Fig. 7. After least-squares fitting by b0 + b1 r, the slope coefficient b1 reflects the complexity of the surface of the object. The coefficient b0 also reflects the complexity, but with emphasis on the capacity of the object to be eroded and dilated without self-intersections, i.e., the first step of the two-step process as described above. 2.3
Characteristic Signatures
The 40 models listed in Table 1 are used, each with four mesh resolutions. As explained before, the first three mesh resolutions are used for constructing the FV of a model, and the last one is used for testing. Each model is characterized by 8 parameters, 5 from the method described in Section 2.1 (surface A of original model after normalization to unit sphere; contraction ratio C after 10 iterations; 3 coefficients, a0 , a1 and a2 of the quadratic approximation of the smoothing curves); and 3 from Section 2.2 (volume V of original model after normalization
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures
143
Fig. 7. Dilation-erosion function of Horse model (resolution 7.5) as function of radius
Fig. 8. Characteristic functions: mesh-smoothing function of the DancingChildren model (left) and dilation-erosion function of the Horse model
to unit sphere; linear coefficients b0 and b1 of the approximated difference volume between dilated-eroded surfaces after 10 iterations). The ten iterations used in both methods were defined in order to keep the representative functions of the models well fitting to the models. Figure 8 shows typical mesh-smoothing and dilation-erosion functions.
3
Results
The FVs of the objects’ test resolutions were compared with the FVs of the database which were constructed by combining the three training resolutions. The objects were ranked by using the Euclidean distance between the FVs. Table 2 lists the results, starting with the object with the smallest distance, then the object with the next smallest distance, 40 and so forth, until the fifth object. The average ranking rate R = (1/40) i=1 Pi , where Pi is the ranked position of object i, is 1.075. This means that the majority of objects is ranked at position 1 or 2, at least at the first positions. Indeed, Table 2 shows that 37 objects were ranked first and only 3 second, i.e., when all eight parameters are used. Concerning the objects ranked second, CamelA (6) was ranked after Horse (25), and RStage (36) was ranked after Carter (7). These are rather similar objects, i.e., animals and mechanic pieces, but Horse and Carter were correctly
144
R. Lam and J.M.H. du Buf
Table 2. Ranked objects using all eight parameters. Only three objects (6, 9 and 36) were ranked second. N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Model Amphora Bimba Blade Block Bunny CamelA Carter Chair Cow2 Cow Dancer DancingChildren Dente Dilo Dino Dragon Duck Egea Elk Eros
Resolutions 1-31-16-29-2 2-13-30-27-29 3-22-26-21-10 4-18-17-28-36 5-27-13-30-1 25-6-24-8-15 7-23-36-33-4 8-25-6-24-9 39-9-22-10-3 10-21-9-39-22 11-14-32-15-37 12-19-20-29-31 13-27-5-30-2 14-15-37-11-32 15-37-6-32-25 16-38-31-19-1 17-28-18-40-4 18-17-28-4-40 19-12-38-31-40 20-12-29-5-15
N 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Model Resolutions Fish 21-10-22-3-34 FishA 22-10-39-21-3 Grayloc 23-7-36-33-4 GreekSculpture 24-25-8-10-9 Horse 25-6-24-8-9 IsidoreHorse 26-3-22-21-10 Kitten 27-5-30-13-2 Liondog 28-18-17-4-40 Maneki 29-13-2-27-5 Moai 30-27-2-13-5 Mouse 31-38-19-16-1 Neptune 32-37-15-14-6 Pulley 33-23-7-36-4 Ramesses 34-21-10-22-24 Rocker 35-30-27-26-5 RStage 7-36-23-33-4 Screwdriver 37-15-32-6-25 Squirrel 38-19-31-40-16 Torso 39-9-10-22-21 Vaselion 40-38-19-12-31
ranked first. On the other hand, Cow2 (9) was ranked after Torso (39), but these are quite different objects, and Torso was correctly ranked first. We performed a few additional tests in order to study the significance of individual parameters and possible parameter combinations. Table 3 shows the average ranking rates of all 40 objects when each parameter is used individually. The best parameters are V (ranking rate of 1.75), b1 (1.8), A (2.0), a1 (2.5) and b0 (3.0). The discriminative power of the other three parameters is much poorer. We then did a sequential test. We took the best individual parameter V , and combined it with each of the other seven parameters. Using the best average ranking result, the best couple of parameters was selected and then combined with each of the remaining six parameters, and so on. This is not a full parameter search with all possible combinations, but it gives an impression of the most discriminative parameters. Table 4 lists the first five results. Using more than three parameters does not improve performance, i.e., there are always three objects ranked second. On the basis of Table 3 one might expect that the couple [V, b1 ] would be best, but Table 4 shows that the couple [V, A] performs better. However, the triplet [V, A, b1 ] includes the best three from Table 3. Similarly, the best quadruplet [V, A, b1 , a1 ] includes the best four and the quintuple [V, A, b1 , a1 , b0 ] the best five. The remaining parameters did not improve performance, but the set of only 40 objects may be too small to draw final conclusions, apart from the fact that the best result obtained with all eight parameters is equal to that obtained with only three parameters.
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures
145
Table 3. Average ranking rates using individual parameters Smoothing Morphology A C a0 a1 a2 V b0 b1 2.0 11.7 6.4 2.5 8.9 1.75 3.0 1.8 Table 4. Average ranking rates obtained by a sequential combination of parameters; see text Parameters [V] [V,A] [V,A,b1 ] [V,A,b1 ,a1 ] [V,A,b1 ,a1 ,b0 ]
Ranking rate 1.75 1.2 1.075 1.075 1.075
Finally, in order to further validate our approach we also tested two deformed objects; see Fig. 9. Object Bimba was deformed by applying the algorithm fBM (fractal Brownian Motion, from the Meshlab package [20]) to all its vertices. Object Bunny-iH exhibits the characters i and H on its left flank; Bunny-iH is part of the AIM@SHAPE database. Both objects were correctly matched (ranked first) with the original objects.
Fig. 9. Original models (left) and deformed ones (right), Bimba and Bunny-iH
4
Conclusions and Discussion
The tested signatures—at least three of them—appear to be robust due to their global nature. In addition, small and local deformations of the object’s meshes do not introduce significant modifications of the characteristic signatures, although more types of deformations must be tested with more than two objects. In general, the dataset of 40 objects tested here is too small to compute advanced performance measures as used in the SHREC contest. However, our correct recognition rate of 37/40 = 0.925 is better than the range between 0.45 and 0.70 as achieved in the SHREC contest of 2010 [21]. Therefore, in future
146
R. Lam and J.M.H. du Buf
work the number of objects in our database should be increased such that the significance of individual parameters and the best combinations of these can be validated. In parallel, the method should be tested by using other types of objects, such as 3D meshes of complex proteins. A practical problem is that some objects are not available with different mesh resolutions, while others are not 2-manifold or ”watertight” and these must be pre-processed. Another problem is that the elimination of disconnected parts after erosions (Fig. 6), which has been done manually here using Meshlab, must be automated. The latter problem does not only occur in case of e.g. animals with legs, but can be expected in case of protein structures. Acknowledgements. This work was supported by project the FCT (ISR/IST plurianual funding) through the PIDDAC Program funds.
References 1. Bustos, B., Keim, D.A., Saupe, D., Schreck, T., Vranic, D.: Feature-based similarity search in 3D object databases. ACM Computing Surveys 37, 345–387 (2005) 2. Tangelder, J.W., Veltkamp, R.C.: A survey of content based 3D shape retrieval methods. Multimedia Tools Appl. 39, 441–471 (2008) 3. Saupe, D., Vranic, D.V.: 3D model retrieval with spherical harmonics and moments. In: Radig, B., Florczyk, S. (eds.) DAGM 2001. LNCS, vol. 2191, pp. 392–397. Springer, Heidelberg (2001) 4. Pang, M.-Y., Dai, W., Wu, G., Zhang, F.: On volume distribution features based 3D model retrieval. In: Pan, Z., Cheok, D.A.D., Haller, M., Lau, R., Saito, H., Liang, R. (eds.) ICAT 2006. LNCS, vol. 4282, pp. 928–937. Springer, Heidelberg (2006) 5. Sijbers, J., Dyck, D.V.: Efficient algorithm for the computation of 3D fourier descriptors. In: Proc. Int. Symp. on 3D Data Processing Visualization and Transmission, p. 640 (2002) 6. Assfalg, J., Bimbo, A.D., Pala, P.: Content-based retrieval of 3D models through curvature maps: a CBR approach exploiting media conversion. Multimedia Tools and Applications 31, 29–50 (2006) 7. Matheron, G.: Random sets and integral geometry. John Wiley & Sons, New York (1975) 8. Serra, J.: Introduction to mathematical morphology. Comput. Vision, Graphics and Image Processing 35, 283–305 (1986) 9. Jackway, P.T.: Morphological Scale-Space with Application to Three-Dimensional Object Recognition. PhD thesis, Queensland University of Technology (Australia), Supervisor-Boles, W. W. (1995) 10. Lee, J., Smith, M., Smith, L., Midha, P.: A mathematical morphology approach to image based 3D particle shape analysis. Machine Vision and Applications 16, 282–288 (2005) 11. AIM@SHAPE (2008), http://www.aimatshape.net 12. Vranic, D.: 3D Model Retrieval. PhD thesis, University of Leipzig (2004) 13. Glendinning, R.H., Herbert, R.A.: Shape classification using smooth principal components. Pattern Recognition Letters 24(12), 2021–2030 (2003) 14. Lam, R., Loke, R., du Buf, H.: Smoothing and reduction of triangle meshes. In: Proc. 10th Portuguese Computer Graphics Meeting, pp. 97–107 (2001)
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures
147
15. Lam, R., du Buf, J.M.H.: Invariant categorisation of polygonal objects using multiresolution signatures. In: Proc. KDIR, pp. 168–173 (2009) 16. Lam, R., Hans du Buf, J.M.: Using mathematical morphology for similarity search of 3D objects. In: Vitri` a, J., Sanches, J.M., Hern´ andez, M. (eds.) IbPRIA 2011. LNCS, vol. 6669, pp. 411–419. Springer, Heidelberg (2011) 17. Campbell, R., Flynn, P.: A survey of free-form object representation and recognition techniques. Computer Vision and Image Understanding 81, 166–210 (2001) 18. Shih, F.: Object representation and recognition using mathematical morphology model. Journal of Systems Integration 1, 235–256 (1991) 19. Zaharescu, A., Boyer, E., Horaud, R.: TransforMesh: A topology-adaptive meshbased approach to surface evolution. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part II. LNCS, vol. 4844, pp. 166–175. Springer, Heidelberg (2007) 20. Cignoni, P., Corsini, M., Ranzuglia, G.: Meshlab: an open-source 3D mesh processing system. ERCIM News, 45–46 (2008) 21. Veltkamp, R.C., Giezeman, G.J., Bast, H., Baumbach, T., Furuya, T., Giesen, J., Godil, A., Lian, Z., Ohbuchi, R., Saleem, W.: Shrec 2010 track: Large scale retrieval. In: Proc. of the Eurographics/ACM SIGGRAPH Symp. on 3D Object Retrieval, pp. 63–69 (2010)
3D Facial Feature Detection Using Iso-Geodesic Stripes and Shape-Index Based Integral Projection James Allen, Nikhil Karkera, and Lijun Yin State University of New York at Binghamton
Abstract. Research on 3D face models relies on extraction of feature points for segmentation, registration, or recognition. Robust feature point extraction from pure geometric surface data is still a challenging issue. In this project, we attempt to automatically extract feature points from 3D range face models without texture information. Human facial surface is overall convex in shape and a majority of the feature points are contained in concave regions within this generally convex structure. These “feature-rich” regions occupy a relatively small portion of the entire face surface area. We propose a novel approach that looks for features only in regions with a high density of concave points and ignores all convex regions. We apply an iso-geodesic stripe approach to limit the search region, and apply the shape-index integral projection to locate the features of interest. Finally, eight individual features (i.e., inner corners of eye, outer corners of eye, nose sides, and outer lip corners) are detected on 3D range models. The algorithm is evaluated on publicly available 3D databases and achieved over 90% accuracy on average.
1 Introduction Research in areas such as face recognition, expression analysis, emotional computing and other related areas are now increasingly focusing on using 3D models as a source of input. Such a representation has a benefit of overcoming issues arising from pose and lighting variations, which are suffered inherently by 2D modalities [1][2]. 3D models show promises in characterizing facial surface in a detailed level. Dynamic model sequences can also provide precise spatio-temporal information in the 3D space. However, such data (e.g., 3D scans) obtained by range systems is in a raw format, which is “blind” without the information of facial structures. Consequently, information on functional structures for animation and recognition is completely lacking. Moreover, there is no existing point-to-point correspondence between the vertices of different scan models. Each capture generates a different number of vertices, which adds to the complexity of tracking 3D facial features (e.g., vertex correspondences) across 3D dynamic facial model sequences. In short, analyzing the original “raw” models automatically over time is a significant challenge, due to the large amount of model points and the lack of the correspondence across model sequences. In order to overcome these limitations, we address the issue of automatic detection of 3D feature points on geometric mesh models. To date, many researchers have applied various approaches to represent and use facial scans for face and facial expression analysis; for example, morphable models [3][4], vertex flow models [5], elastically deformable models [6], harmonic mapping G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 148–157, 2011. © Springer-Verlag Berlin Heidelberg 2011
3D Facial Feature Detection Using Iso-Geodesic Stripes and Shape-Index
149
approach [7][8], and graph matching approach [9] for 3D dense data tracking and non-rigid surface registration. These methods have produced very impressive results. However, most of approaches were based on the initialization of several initial feature points, which were provided mainly by manual selection. In this paper we present a simple yet efficient approach to detect 3D facial features on geometric mesh models. Eight features are automatically detected on inner corners of two eyes, outer corners of two eyes, nose sides, and lip corners. Human face is a convex structure overall and feature defining points are usually located within concave shapes on the face. These “feature-rich” regions occupy a relatively small portion of the entire face surface area. First of all, we segment the face model into several isogeodesic stripes for limiting the search regions of facial features. We calculate the geodesic space from the nose tip to all other vertices of the face model using Dijkstra’s shortest path algorithm [12]. The geodesic space is normalized for covering facial features person-independently. Then, we compute the shape index of each vertex of mesh models, and conduct a shape-index based integral projection to detect several “feature bands”. Since the feature regions are limited in the iso-geodesic strips, the intersection of the “feature bands” and the “iso-geodesic strips” can future locate the feature points. We estimate those points by clustering the concave points within the intersection regions. Figure 1 shows the general diagram of the proposed algorithm.
Intersection between Iso-Geodesic Strips and Feature Bands.
Feature Band detection: Shape-index based integral projection
Fig. 1. General diagram of feature detection on 3D geometric face models
In Section 2, the iso-geodesic strips segmentation and its unique coverage of facial features are introduced. Section 3 describes the shape-index based integral projection
150
J. Allen, N. Karkera, and L. Yin
approach for feature bands detection. Section 4 shows how the features are estimated by a clustering approach. Experimental results on two 3D face databases are reported in Section 5, followed by a discussion and conclusion at final.
2 Iso-Geodesic Stripes Segmentation A 3D facial surface can be decomposed into a set of iso-geodesic stripes. Iso-geodesic stripes are defined with reference to a Morse function, which is a smooth and real valued function defined on the object surface [10, 15]. These stripes are obtained by measuring the geodesic distance of surface point (i.e., every mesh vertex) to a fiducial (reference) point located on the nose tip. Iso-geodesic stripes are loci of surface points characterized by the same value of the Morse function, ranging from 0 to 1. Existing work shows that the facial feature-rich areas such as eye corners, nose sides, and lip corners reside in these stripes regardless of subjects and expressions [10]. The Morse function was chosen in part because it allows us to use a global topological structure at the start and does not have to build up smaller defined topological structures [11]. Stripes with a same Morse value on different facial models cover the similar facial areas and features.
Fig. 2. Iso-Geodesic Strips on a 3D model sequence
In order to compute the value of the Morse function, it is critical to select a reference point. Because the nose tip is relatively reliable in terms of expression variations, we choose it to be the reference point. We apply the approach introduced by Y. Sun et al in [14] to estimate the pose vector of the facial model. We then rotate all the models to the frontal view. The reference point (nose tip) is determined by iterating through the facial model and finding the vertex point with the greatest Z-axis value. Given the reference point, we calculate the geodesic distances for all vertices by using the Dijkstra’s algorithm [12]. Once these distances are calculated the IsoGeodesic stripes can be obtained. In order to make the iso-stripe description of facial surface person-independent and expression invariant, a normalization process is applied. The value of the Morse function on a generic point on the model surface is defined as the normalized geodesic distance of the point to the nose tip. Normalized values of the geodesic distance are obtained by dividing the geodesic distance by the Euclidean headtop-tonose distance. This normalization guarantees invariance of Morse function values with respect to scaling of the face model. Furthermore, since the Euclidean (head-top to nose) distance is invariant to face expressions, this normalization factor does not bias values of the Morse function under expression changes.
3D Facial Feature Detection Using Iso-Geodesic Stripes and Shape-Index
151
Once values of the Morse function are computed for every surface point, isogeodesic stripes can be identified. For this purpose, the range of Morse function values is quantized into n intervals c1, . . . , cn. Accordingly, n level set stripes are identified on the model surface, the i-th stripe corresponding to the set of surface points on which the value of the Morse function falls within the limits of interval ci. In this work, the length of the strips is set to 0.2 for a best performance of covering all the features of interest. In general, the iso-geodesic stripes are invariant to subjects, expressions, and their scales. Figure 2 shows an example of iso-geodesic strips on a 3D facial expression sequence. As shown in this figure, the second stripe (pink color) always contains the nose sides. The third stripe (orange color) always contains both right and left inner eye points and both corners of lip. The fifth strip (greenish color) always contains the outer corners of two eyes.
3 Feature Band Detection - Shape-Index Based Integral Projection 3.1 Face Model Concave Features by Shape Index Shape index is a quantitative measure of the shape of a surface at a point [12][17]. It gives a numerical value to a shape thus making it possible to mathematically compare shapes and categorize them. For our algorithm it is used to classify a shape as concave or non-concave. Shape Index is defined as follows: S=
1 1 k 2 + k1 − ∗ arctan( ) 2 π k 2 − k1
(1)
where k1 and k2 are the principal (minimum and maximum) curvatures of the surface, with k2 >= k1. With this definition, all shapes can be mapped on the range [0.0, 1.0]. Every distinct surface shape corresponds to a unique shape index value, except the planar shape. Points on a planar surface have an indeterminate shape index, since k1 = k2 = 0. The shape index is computed for each point on the model. We use a cubic polynomial fitting approach to compute the eigen-values of the Weingarten Matrix [12], resulting in the minimum and maximum curvatures (k1, k2). To visualize the shape indexes on the model, we transform the shape index values ranging from [0.0, 1.0] to a grey scale map ranging from black to white. We treat a surface point as a concave point if the shape-index value is under a certain threshold (e.g., 0.6). This value can effectively eliminate convex shapes as well as shapes that are not concave enough to be considered as part of a distinct feature region. As an example shown in Figure 4 (right), most points of the model are convex-like, while the features of interest are located in the darker area (concave points). This fact allows us to eliminate a large portion of the surface and focus on the small areas for feature detection. 3.2 Identifying Feature Regions – Feature Bands In order to identify the eight features on the facial mesh model, we limit the search in three regions: eye, nose, and mouth to form three feature bands. To do so, we project
152
J. Allen, N. Karkera, and L. Yin
the model onto the X-Y plane. Then we divide the face into a set of very thin, equal width, horizontal bands. The number of bands is fixed (e.g., 100 bands). In each band, we conduct an integral projection by counting concave points within the band. Then a projection curve is plotted. As shown in Figure 3 (a), highly concave regions of the face model are represented as high-energy regions in the curve and vice versa.
(a)
(b)
(c)
(d)
Fig. 3. (a) Integral projection curve of concave points against (b) the corresponding face model. (c) Original curve versus (d) thresholded curve.
As one can observe from the curve, the eye, nose and mouth regions appear to have three distinct high energy peaks. Our goal is to isolate these peaks from the curve. First of all, we apply a low pass filter to eliminate some noises from the curve. The filter is designed as {0, -0.5, 0, 1, 2, 1, 0, -0.5, 0}. Then we isolate the peaks by applying a threshold and shaving off the low value samples. The threshold is obtained by the percentage of the maximum energy of the curve. From experiments, 30% is sufficient to serve for this purpose. Figure 3 (c-d) shows the curve after thresholding with several groups of samples. Each group constructs a section. Among those isolated sections, we choose the peak sections with the highest energy and exclude the rest. To the end, we extract three peaks from the curve corresponding to the eye, nose and mouth regions, and construct the three feature bands, as shown in Figure 4.
Fig. 4. Extracted feature bands from highest energy peaks in three sections
3D Facial Feature Detection Using Iso-Geodesic Stripes and Shape-Index
153
4 Identification of Feature Points Given the extracted feature bands and iso-geodesic strips, we can find their intersections and limit the feature search within those intersection regions. For example, the intersection between the 2nd stripe and the nose band determines two nose-side regions. The intersection between the 3rd stripe and the eye band determines two regions of inner corners of two eyes. Similarly, the regions of outer corners of two eyes and the lip corners can also be located from the pair (5th stripe and eye band) and the pair (3rd stripe and mouth band). To further limit the search regions and remove the influence of noises, we apply an additional operation by vertical integral projection of each feature band. The count of concave points is performed in the vertical direction, resulting in three curves, one each for the eye band, nose band, and mouth band (see Figure 5).
Fig. 5. Integral projection histogram (curves): Top for eye band; Middle for nose band, and the bottom for mouth band
From each curve of vertical projection, two extreme points (left and right of the curve) could indicate the positions of corner features (e.g., lip corners, nose sides, etc.) Similar to the curve processing (horizontal integral projection) in the previous section, we also apply a thresholding approach to locate the feature positions. Noise in these curves is generally due to concave regions that are away from the main feature region. For example for the mouth region it could be the presence of dimples on the cheek. These portions therefore create distinct, low energy peaks in the curve. A “distinct peak” is defined as one which is bordered by zero-value samples on either edge. To remove this noise we eliminate any distinct peaks whose energy is lower than a percentage of the total energy of the curve (e.g., 10% is used for this purpose).
154
J. Allen, N. Karkera, and L. Yin
To this end, the location of corner features is estimated. In combining with feature band, iso-geodesic strips, the feature locations are further narrowed into smaller regions, thus allowing us to refine the feature positions by a concave points clustering approach which will be described next. 4.1 Concave Points Clustering Within the search area, we search for all concave points and group them into separate sets of connected components. The conventional recursive clustering algorithm is used. An initial seed concave point is randomly picked. Then, the connected concave points are searched recursively until all the connected points have been grouped in a set. The algorithm continues to pick another new seed point from ungrouped points, and a new round of search is carried out for grouping a second set of concave points. This procedure is repeated until all the concave points are grouped. At final, the largest set of concave points is taken as the feature set, and its weight center is estimated as the feature center. Figure 6 shows an example of detected eight features on a 3D face model. Note that since the search regions have been limited in small areas, the searching process is very efficient for small sets of concave points.
Fig. 6. 3D feature points (in red dots marked with estimated intersection regions) in two views
5 Experiments We conducted feature detection experiments on 3D face databases [14][16]. Figure 7 shows some samples of feature detection on 3D model sequences with different expressions. Feature bands are marked in light purple, and feature points in red. This sample illustrates a depression in the area between the mouth and the chin while performing the angry expression. In general, the algorithm performed well for the features detected on nose sides, while some false-detections occurred in mouth and eye areas. In addition to above subjective evaluation, we conducted an objective evaluation, by which we calculated the error between the feature points detected and the corresponding manually picked points on the face scans. We manually selected 8 key points as the ground truth in areas of the mouth, eyes, and nose. After randomly
3D Facial Feature Detection Using Iso-Geodesic Stripes and Shape-Index
155
Fig. 7. Samples of detected features (red dots) on 3D facial expression sequences
selecting 200 models, we conduct a quantitative measurement as follows. We define an entity for feature in examination. For example, the eye entity (Re) is the eye width (distance of the two corners of an eye). The mouth entity (Rm) is the mouth width (distance of two corners of a mouth). The nose entity (Rn) is the nose width (distance of two sides of a nose). The absolute difference (D) is defined as a distance between the detected feature and the ground-true feature. Therefore, the error of a feature is measured by the ratio of absolute difference versus the feature entity. Such a percentage measurement shows the relative error of feature points. For example: error of eye features is Ee = De/Re, error of nose features is En=Dn/Rn, and the error of mouth features is Em=Dm/Rm. Table 1 shows the average relative error of the detected features on 200 models. Table 1. Relative measurement: relative errors (average) of detected feature points
Feature Points Eye inner corner (left) Eye inner corner (right) Eye outer corner (left) Eye outer corner (right) Nose corner (left) Nose corner (right) Mouth corner (left) Mouth corner (right)
(Average) Relative Error % 4.1% 3.7% 8.7 % 9.8 % 3.2 % 3.9 % 5.8 % 7.1 %
In general, the eye corners show more error than other features due to the mesh noise or relatively lack of mesh details or confusion with eyebrow meshes in those areas. The outer eye corner may not be distinctly concave. Presence of eyebrows in this region confuses the algorithm to a certain extent. Also, while projecting the face onto a 2D plane the points on the edges get a compressed representation which makes it difficult to analyze the characteristics of that region.
6 Conclusions and Future Work This paper presents a novel yet efficient approach for automatically detecting 3D feature points on 3D range face models. The algorithm presented takes into
156
J. Allen, N. Karkera, and L. Yin
consideration iso-geodesic stripes and concave points integral projection. The feature determination is based on the intersection of feature regions (i.e., feature bands, isogeodesic stripes, and positions from vertical integral projection). Among the test data, over 90% accuracy has been achieved on average in detecting eight features on 3D face models. Our future work consists of developing more robust algorithms to detect more feature points and improving the current approach for addressing more expression variations. We will also consider use shape-index accumulation approach to improve the performance of feature bands detection both vertically and horizontally. We will also test the algorithms on a larger volume of datasets (e.g., FRGC 2.0 dataset). Acknowledgement. This material is based upon work supported in part by NSF (IIS-1051103, IIS-0541044), NYSTAR, and AFRL.
References [1] Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face Recognition: A Literature Survey. ACM Computing Surveys 35(4), 399–458 (2003) [2] Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K., Marques, J., Min, J., Worek, W.: Overview of the Face Recognition Grand Challenge. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2005) [3] Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH 1999, pp. 187–194 (1999) [4] Blanz, V., Scherbaum, K., Seidel, H.: Fitting a morphable model to 3D scans of faces. In: IEEE International Conference on Computer Vision, ICCV (2007) [5] Sun, Y., Chen, X., Rosato, M., Yin, L.: Tracking vertex flow and model adaptation for 3D spatio-temporal face analysis. IEEE Trans. on System, Man, and Cybernetics – Part A 40(3), 461–474 (2010) [6] Mpiperis, I., Malassiotis, S., Strintzis, M.: Bilinear Models for 3-D Face and Facial Expression Recognition. IEEE Trans. on Information Forensic and Security 3(3), 498– 511 (2008) [7] Wang, S., Wang, Y., Gu, X., Samaras, D.: 3D surface matching and recognition using conformal geometry. In: IEEE International Conference on Computer Vision and Pattern Recognition, CVPR (2006) [8] Wang, Y., Gupta, M., Zhang, S., Wang, S., Gu, X., Samaras, D., Huang, P.: High resolution tracking of non-rigid motion of densely sampled 3D data using harmonic maps. International Journal of Computer Vision 76(3), 283–300 (2008) [9] Zeng, Y., Wang, C., Wang, Y., Gu, X., Samaras, D., Paragios, N.: Dense Non-rigid Surface Registration Using High-Order Graph Matching. In: IEEE International Conference on Computer Vision and Pattern recognition, CVPR (2010) [10] Berretti, S., Bimbo, A., Pala, P.: Description and retrieval of 3d face models using isogeodesic stripes. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, MIR (2006) [11] Besl, P.: The Free-Form Surface Matching Problem. In: Freeman, H. (ed.) Machine Vision for Three-Dimensional Scenes, pp. 25–71. Academic Press, New York (1990) [12] Dorai, C., Jain, A.: Cosmosa representation scheme for 3d free-form objects. IEEE Trans. Pattern Analysis and Machine Intelligence 19(10) (1997)
3D Facial Feature Detection Using Iso-Geodesic Stripes and Shape-Index
157
[13] Sun, Y., Yin, L.: Automatic Pose Estimation of 3D Models. In: IEEE/IAPR International Conference on Pattern Recognition, ICPR (2008) [14] Yin, L., Chen, X., Sun, Y., Worm, T., Reale, M.: A High-Resolution 3D Dynamic Facial Expression Database. In: The 8th International Conference on Automatic Face and Gesture Recognition (FG 2008), Amsterdam, the Netherlands (2008) [15] Milnor, J.: Morse Theory. Princeton University Press, Princeton (1963) [16] Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.: A 3D Facial Expression Database For Facial Behavior Research. In: The 7th International Conference on Automatic Face and Gesture Recognition (FG 2006), Southampton, UK, pp. p211–p216, April 10-12 (2006) [17] Koenderink, J., van Doorn, A.: Surface shape and curvature scales. Image and Vision Computing 10(8), 557–564 (1992)
Hybrid Face Recognition Based on Real-Time Multi-camera Stereo-Matching J. Hensler, K. Denker, M. Franz, and G. Umlauf University of Applied Sciences Constance, Germany
Abstract. Multi-camera systems and GPU-based stereo-matching methods allow for a real-time 3d reconstruction of faces. We use the data generated by such a 3d reconstruction for a hybrid face recognition system based on color, accuracy, and depth information. This system is structured in two subsequent phases: geometry-based data preparation and face recognition using wavelets and the AdaBoost algorithm. It requires only one reference image per person. On a data base of 500 recordings, our system achieved detection rates ranging from 95% to 97% with a false detection rate of 2% to 3%. The computation of the whole process takes around 1.1 seconds.
1
Introduction
In the last years, 3d face recognition has become an important tool in many biometric applications. These systems are able to achieve high detection rates. However, there is one major drawback: the overall recognition process, including 3d reconstruction and face recognition, takes several seconds to several minutes. This time is unacceptable for biometric systems, e.g. security systems, credit card verification, access control or criminal detection. In order to speed up this process, a multi-camera stereo-matching system has been developed that can generate a high-resolution depth image in real-time [1]. Here, we use such a system (shown in Figure 1) for face recognition. A typical recording of this system is shown in Figure 2. Since most computations are done on the GPU, the system needs an average computation time of 263 milliseconds for one high resolution depth image (see [1]). In this paper, we show that the quality of these depth images is sufficiently high for 3d face recognition in the context of an access control system. An access control system requires a high detection rate at a low computation time. Hence, the recognition algorithm combines three different types of information obtained from the multi-camera stereo-matching system: a depth image (Figure 2(b)), a color image (Figure 2(a)), and a 3d reconstruction quality image (Figure 2(c)). Our 3d face recognition algorithm is structured in two subsequent phases (Figure 3): the data preparation phase (Section 3) and the face recognition phase (Section 4). In the data preparation phase the face data is segmented from the background in the color and depth images. Then, the 3d face data is transformed into frontal position by an optimized iterative closest point (ICP) algorithm. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 158–167, 2011. c Springer-Verlag Berlin Heidelberg 2011
Hybrid Face Recognition
159
Fig. 1. The multi-camera stereo-matching system used in this paper generates one depth image from four camera images
(a) Color image
(b) Depth image
(c) Quality image
Fig. 2. A typical recording of the multi-camera stereo-matching system. Bright pixels in the quality image depict regions with poor variation in the depth image.
Regions with poor quality are improved by a hole-filling algorithm. The face recognition phase uses an AdaBoost classifier based on histogram features that describe the distribution of the wavelet coefficients of the color and depth images.
2
Related Work
Similar to 2d face recognition, 3d face recognition methods can be divided into global and local approaches. Global methods recognize the whole face at once while local approaches separate features of the face and recognize these features independently. A global approach is used in [2]. After a data preparation using symmetry- and nose-tip detection, an eigenface based recognition is computed on the normalized depth images. For eigenfaces [3] a principal component analysis (PCA) is applied to the images from a face data base to compute basis-images. These basis-images are linearly combined to generate synthetic face images. Morphable models are parametric face models yielding a realistic impression used for 3D face synthesis [4]. In [5] these models are used for face recognition. The morphable model is fitted to a photograph and a distance of the model parameters is used for recognition. Fitting the morphable model takes several
160
J. Hensler et al.
Data acquisition Color image Depth image Quality image
Data preparation
Face recognition
Background separation
Wavelet analysis
Hole filling
χ2 or GGD
Alignment
AdaBoost
Fig. 3. The structure of the 3d face recognition system
minutes. A fast modification of this method is presented in [6]. Only for the training faces a morphable model is computed. For the recognition a support vector machine (SVM) is used to compare synthetic images of face components from the morphable model with face components extracted from photographs. SVM based face recognition methods, as [6–8], need a large training data base. The SVM is trained using several hundred positive and negative example data sets. To speed the training of the SVM up, the data is reduced to a set of facial features. Because of this reduction, these methods are local. An ICP algorithm similar to our data preparation phase is used in [9]. After a pre-matching using facial features, ICP is used to get a precise fit of the test data to a reference face. Differences of surface points on both data sets are used for recognition. Here, a PCA is used to reduce the dimension of the search space, where a Gaussian mixture model is used for the final recognition.
3
Data Preparation
The data preparation phase gets as input the color, depth, and quality images as computed by a system like the one presented in [1]. For face recognition it is necessary to separate the regions in the images that contain information of the face from irrelevant, background regions. In an access control system, we assume that the face is the object closest to the camera. Thus, the points of the face are identified in the depth image to separate the face from the background in the color and quality images. The quality image contains information about the faithfulness of the 3d reconstruction. Low quality values characterize regions with a large instability in the depth image. Thus, these regions are removed from the 3d face model, leaving holes. These holes are filled with a moving least squares approach fitting a polynomial surface of up to degree four to the points around the hole [10]. Although, after the hole filling the depth image contains a complete 3d model of the face, its affine position relative to the camera is unknown. To align the
Hybrid Face Recognition
(a)
(b)
161
(c)
Fig. 4. (a) ICP fit of a 3d mannequin head model (white points) to an incomplete 3d model, (b) aligned color image, and (c) depth image after the hole-filling
3d face model we fit it to a mannequin head model in frontal position using an iterative closest point (ICP) algorithm [11]. For each point on both models the nearest point on the other model is computed. Then, a global affine transformation minimizing the distance of these point-pairs is computed. This affine transformation is applied to the 3d face model and the procedure is repeated until the changes become small enough. For the 3d models in our application with more than 200,000 data points the ICP algorithm is speed up as in [12]: – – – –
Point-pairs are computed only for a random subset of points. To compute the point-pairs a kd-tree is used. Outliers are rejected by a point-to-point distance threshold. For the first few iterations point-to-point distances are used. Later the algorithm uses point-to-plane distances.
A resulting 3d model after ICP alignment is shown in Figure 4(a) for an 3d model without hole filling. The white points show the mannequin model. After the alignment also the color and the depth image are aligned with the computed affine transformation, see Figures 4(b) and 4(c). Further results of the complete data preparation phase for three depth images of the same person are shown in Figure 5. These images show that the data preparation is robust against different positions of the person to the camera, different rotations of the head, and different facial expressions.
4
Face Recognition
The face recognition phase is based on the aligned and completed depth and color images. First, a 2d wavelet transform is applied to both the depth and the color image. This transform generates a series of smaller images, called subbands, using a bank of low- and high-pass filters. Depending on the choice of the filters, one obtains different types of wavelets. We tested eight wavelets: Quadratic mirror filter (QMF) wavelets of size 5, 9 and 13, Daubechies wavelets of size 2, 3 and 4, and bi-orthogonal CDF wavelets of size 5/3 and 9/7. The
162
J. Hensler et al.
Fig. 5. Result of the data preparation phase: Three different depth images of the same person aligned to a frontal position (aligned color/depth image in resp. right column)
structure of the wavelet-transformed images is shown in Figure 6 where L and H refer to low-pass or high-pass filtering in either horizontal or vertical direction. The number refers to the level (octave) of the filtering. At each level, the low pass sub-band (LL) is recursively filtered using the same scheme. The low frequency sub-band LL contains most of the energy of the original image and represents a down-sampled low resolution version. The higher frequency sub-bands contain detail information of the image in horizontal (LH), vertical (HL) and diagonal (HH) directions. The distribution of the wavelet coefficient magnitudes in each sub-band are characterized by a histogram. Thus, the entire recording is represented by a feature vector that consists of the histograms of all sub-bands of the depth and the color image. Note that the wavelet coefficients of each sub-band are uncorrelated. Hence, it makes sense to train individual classifiers for each sub-band (referred to as weak classifiers) which are subsequently combined into a strong classifier by the AdaBoost algorithm. Our weak classifiers are simple thresholds on a similarity metric between sub-band histograms. We tested two types of similarity metrics: (1) the χ2-metric for histograms, and (2) the Kullback-Leibler (KL) divergence of of a generalized Gaussian density (GGD) functions fitted to the histogram.
LL3 HL3 HL2 LH3 HH3 HL1 LH2
HH2
level LH1
HH1
horizontal vertical
Fig. 6. The sub-band labeling scheme for a three level 2D wavelet transformation
Hybrid Face Recognition
4.1
163
χ2 -Metric
The distribution of the wavelet coefficients of each sub-band is represented in a histogram. In order to find the optimal bin size for the histograms we used the method of [13] according to which the optimal bin size h is given by √ (1) h = 2(Q0.75 − Q0.25 )/ 3 n where Q0.25 and Q0.75 are the 1/4- and 3/4-quantiles and n is the number of recordings in the training data base. The χ2 -metric computes the distance d between two sub-band histograms H1 and H2 with N bins as d(H1 , H2 ) =
N (H1 (i) − H2 (i))2 i=1
4.2
H1 (i) + H2 (i)
.
(2)
KL Divergence between Generalized Gaussian Density Functions
As an alternative to the χ2 -metric, we tested a generalized Gaussian density (GGD) based method [14]. This method defines an individual GGD function that is fitted to the coefficient distribution of a sub-band of the wavelet transform. The optimal fit is obtained from maximizing the likelihood using the NewtonRaphson method [14–16]. The distance between two GGD functions is estimated by the Kullback-Leibler divergence [17]. 4.3
The AdaBoost Algorithm
The concept of boosting algorithms is to combine multiple weak classifiers to yield a strong classifier that solves the decision problem. The idea is that it is often easier to find several simple rules for a decision instead of one complex rule. The AdaBoost algorithm uses a training data set to build a strong classifier out of weak classifiers that solve binary decisions. For this purpose, the algorithm needs weak classifiers with a success rate of at least 50% on the training data with independent errors. Then, the AdaBoost algorithm can be shown to improve the error rate by computing an optimal weight for each weak classifier. Let yi = hi (x) denote the output of the i-th of the M weak classifiers to the input x, and αi the weight of hi (x) generated by the AdaBoost algorithm. Then, the strong classifier is given by [18] M H (x) = sign αi hi (x) . (3) i=1
5
Results
For training and testing we collected a data base of approximately 500 depth images from 40 different persons. For some persons the images were taken at different times, with different lighting, different positions with respect to the camera
164
J. Hensler et al.
Fig. 7. Example images from our data base used for training and testing of the AdaBoost algorithm
system, different facial expressions (open/closed mouth, smiling/not smiling, open/closed eyes) and different facial details (glasses/no glasses). Some example images are shown in Figure 7. The results of our recognition system are shown in the receiver operating characteristic (ROC) diagrams in Figure 9 and Table 1. The system was tested with different wavelet transform levels and different wavelet filters. Note that, if the weak classifier are too strong or too complex, boosting might fail to improve the recognition, cf. [19]. An indicator for this behavior is a quick decrease of the error rate in the training phase. The error rate in the training phase compared to the number of weak classifiers is illustrated in Figure 8. Here, in the first wavelet level the error rate starts very low and strong classifiers improve relatively slow. At wavelet level three the error rate starts higher and the boosting finds more weak classifiers to improve the error rate more effectively. Hence, a more robust and more reliable result is achieved in the third level of the wavelet decomposition. Table 1. Results with our approach after 3-fold cross validation with different wavelet transformation levels and wavelet filters filter qmf5 qmf9 qmf13 daub2 daub3 daub4 cdf53 cdf97
level=1 level=2 level=3 0,9831 0,9898 0,9898 0,9848 0,9897 0,9884 0,9817 0,9890 0,9895 0,9798 0,9877 0,9892 0,9843 0,9859 0,9898 0,9877 0,9873 0,9891 0,9847 0,9893 0,9914 0,9836 0,9900 0,9912
Mean 0,9837 Std 0,0023
0,9886 0,0015
0,9898 0,0010
Hybrid Face Recognition
165
Fig. 8. Classification error versus number of weak classifiers at level one and three of the wavelet decomposition
Fig. 9. ROC curves for different wavelet transformation levels. At each level the four sub-bands LH, HL, HH, and LL for the depth (D ) and color (C ) images and their combination with AdaBoost (3d face) are shown.
Table 1 shows that the choice of the used wavelet filter does influence the result clearly. The best result is achieved with the cdf53/cdf97 filter and wavelet transformation at level three. χ2 -histogram-comparison and GGD fitting yield similar results. Since the former is computationally more efficient we use this metric in the current version of our system for faster response times. The recognition results are shown in Figure 9. The detection rates between 95% and 97% for the low false positive rate of 2% to 3% are obtained at the point of the minimal overall error of the ROC curve. The AdaBoost combination (3d face) of all sub-bands yields the best decision at levels two and three. At wavelet level four, the sub-bands are getting too small and the final AdaBoost classificator is not effective. For the presented results, we use the FireWire camera system from [1]. Color images and depth maps from this system have a resolution of 1392 × 1032 pixels. Currently the overall recognition time is 1.086 seconds with the χ2 -metric. This includes the 3d reconstruction (263 ms [1]), the data preparation (731 ms), and
166
J. Hensler et al.
the face recognition (χ2 method - level 3 - 92ms). The most time is consumed by the data preparation which takes approximately 65% of the overall time. We are working here on further improvements on the ICP algorithm, e.g. finding a better initial guess.
6
Conclusion and Future Work
Our analysis shows that the proposed system has a satisfying face recognition performance which is competitive to other systems, cf. [20]. A special advantage of our system is that it requires only one single reference depth image per person. Other systems often need more than one reference image without obtaining better ROC curves than ours, e.g. [7, 8]. Since the quality of the 3d model, colors, and shadows in the 2D images critically depend on the lighting of the faces, we expect that the performance of the current system can be significantly improved by controlling the lighting conditions. All computations take about one second which is acceptable for a biometric system. This computation time also allows for taking several subsequent images to improve the detection rate. However, we are still working on various optimizations, especially for the data preparation phase that will further reduce processing time. Acknowledgements. This work was supported by AiF ZIM Project KF 2372101SS9. We thank the students and employees of the HTWG Konstanz for providing the data in our face data base.
References 1. Denker, K., Umlauf, G.: Accurate real-time multi-camera stereo-matching on the gpu for 3d reconstruction. Journal of WSCG 19, 9–16 (2011) 2. Pan, G., Han, S., Wu, Z., Wang, Y.: 3D face recognition using mapped depth images. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 175– 181 (2005) 3. Turk, M., Pentland, A.: Eigenfaces for recognition. Cognitive Neuroscience 3, 71–86 (1991) 4. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH 1999, pp. 187–194 (1999) 5. Blanz, V., Romdhani, S.: Face identification across different poses and illuminations with a 3d morphable model. In: Int’l. Conf. on Automatic Face and Gesture Recognition, pp. 202–2007 (2002) 6. Weyrauch, B., Huang, J., Heisele, B., Blanz, V.: Component-based face recognition with 3d morphable models. In: Workshop on Face Processing in Video, pp. 1–5 (2003) 7. Lee, Y., Song, H., Yang, U., Shin, H., Sohn, K.: Local feature based 3D face recognition. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 909–918. Springer, Heidelberg (2005) 8. Lee, J., Kuo, C., Hus, C.: 3d face recognition system based on feature analysis and support vector machine. In: IEEE TENCON 2004, pp. 144–147 (2004)
Hybrid Face Recognition
167
9. Cook, J., Ch, V., Sridharan, S., Fookes, C.: Face recognition from 3d data using iterative closest point algorithm and Gaussian mixture models. In: 2nd Int’l. Symp. 3D Data Processing, Visualization, and Transmission, pp. 502–509 (2004) 10. Wang, J., Oliveira, M.: A hole-filling strategy for reconstruction of smooth surfaces in range images. In: SIBGRAPI 2003, pp. 11–18 (2003) 11. Besl, P., McKay, N.: A method for registration of 3-D shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 239–256 (1992) 12. Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: 3dim, p. 145. IEEE Computer Society, Los Alamitos (2001) 13. Freedman, D., Diaconis, P.: On the histogram as a density estimator: L2 theory. Probability Theory and Related Fields 57, 453–476 (1981) 14. Lamard, M., Cazuguel, G., Quellec, G., Bekri, L., Roux, C., Cochener, B.: Content based image retrieval based on wavelet transform coefficients distribution. In: 29th IEEE Conf. of the Engineering in Medicine and Biology Society, pp. 4532–4535 (2007) 15. Varanasi, M., Aazhang, B.: Parametric generalized Gaussian density estimation. J. of the Acoustical Society of America 86, 1404 (1989) 16. Do, M., Vetterli, M.: Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Trans. on Image Processing 11, 146– 158 (2002) 17. Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951) 18. Hensler, J., Blaich, M., Bittel, O.: Improved door detection fusing camera and laser rangefinder data with AdaBoosting. In: 3rd Int.’l Conf. on Agents and Artificial Intelligence, pp. 39–48 (2011) 19. Schapire, R.: A brief introduction to boosting. In: International Joint Conference on Artificial Intelligence, vol. 16, pp. 1401–1406 (1999) 20. Bowyer, K., Chang, K., Flynn, P.: A survey of approaches and challenges in 3d and multi-modal 3d+2d face recognition. Computer Vision and Image Understanding 101, 1–15 (2006)
Learning Image Transformations without Training Examples Sergey Pankov Harik Shazeer Labs, Palo Alto, CA 94301
Abstract. The use of image transformations is essential for efficient modeling and learning of visual data. But the class of relevant transformations is large: affine transformations, projective transformations, elastic deformations, ... the list goes on. Therefore, learning these transformations, rather than hand coding them, is of great conceptual interest. To the best of our knowledge, all the related work so far has been concerned with either supervised or weakly supervised learning (from correlated sequences, video streams, or image-transform pairs). In this paper, on the contrary, we present a simple method for learning affine and elastic transformations when no examples of these transformations are explicitly given, and no prior knowledge of space (such as ordering of pixels) is included either. The system has only access to a moderately large database of natural images arranged in no particular order.
1
Introduction
Biological vision remains largely unmatched by artificial visual systems across a wide range of tasks. Among its most remarkable capabilities are the aptitude for unsupervised learning and efficient use of spatial transformations. Indeed, the brain’s proficiency in various visual tasks seems to indicate that some complex internal representations are utilized to model visual data. Even though the nature of those representations is far from understood, it is often presumed that learning them in an unsupervised manner is central to the biological neural processing [1] or, at very least, highly relevant for modeling neural processing computationally [2–4]. Likewise, it is poorly understood how the brain implements various transformations in its processing. Yet it must be clear that the level of learning efficiency demonstrated by humans and other biological systems can only be achieved by means of transformation-invariant learning. This follows, for example, from an observation that people can learn to recognize objects fairly well from only a small number of views. Covering both topics (unsupervised learning and image transformations) at once, by way of learning transformations without supervision, appears interesting to us for two reasons. Firstly, it can potentially further our understanding of unsupervised learning: what can be learned, how it can be learned, what are its strengths and limitations. Secondly, the class of transformations important for representing visual data may be too large for manual construction. In addition G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 168–179, 2011. c Springer-Verlag Berlin Heidelberg 2011
Learning Image Transformations without Training Examples
169
to transformations describable by a few parameters, such as affine, the transformations requiring infinitely many parameters, such as elastic, are deemed to be important [5]. Transformations need not be limited to spacial coordinates, they can involve temporal dimension or color space. Transformations can be discontinuous, can be composed of simpler transformations, or can be non-invertible. All these cases are likely to be required for efficient representation of, say, an animal or person. Unsupervised learning opens the possibility of capturing such diversity. A number of works have been devoted to learning image transformations [6– 11]. Other works were aimed at learning perceptual invariance with respect to the transformations [12–14], but without explicitly extracting them. Often, no knowledge of space structure was assumed (such methods are invariant with respect to random pixel permutations), and in some cases the learning was termed unsupervised. In this paper we adopt a more stringent notion of unsupervised learning, by requiring that no ordering of an image dataset be provided. In contrast, the authors of the cited references considered some sort of temporal ordering: either sequential (synthetic sequences or video streams) or pairwise (grouping original and transformed images). Obviously, a learning algorithm can greatly benefit from temporal ordering; just like ordering of pixels opens the problem to a host of otherwise unsuitable strategies. Ordering of images provides explicit examples of transformations. Without ordering, no explicit examples are given. It is in this sense that we talk about learning without (explicit) examples. The main goal of this paper is to demonstrate learning of affine and elastic transformations from a set of naturals images by a rather simple procedure. Inference is done on a moderately large set of random images, and not just on a small set of strongly correlated images. The latter case is a (simpler) special case of our more general problem setting. The possibility of inferring even simple transformations from an unordered dataset of images seems intriguing in itself. Yet, we think that dispensing with temporal order has a wider significance. Temporal proximity of visual percepts can be very helpful for learning some transformations but not others. Even the case of 3D rotations will likely require generation of hidden parameters encoding higher level information, such as shape and orientation. That will likely require processing a large number of images off-line, in a batch mode, incompatible with temporal proximity. The paper is organized as follows. A brief overview of related approaches is given in section 2. Our method is introduced in section 3. In section 4 the method is tested on a synthetic and natural sets of random images. In section 5 we conclude with discussion of limitations and possible extensions of the current approach, outlining a potential application to the learning of 3D transformations.
2
Related Work
It is recognized that transformation invariant learning, and hence transformations themselves, possess great potential for artificial cognition. Numerous systems,
170
S. Pankov
attempting to realize this potential, have been proposed over the last few decades. In most cases the transformation invariant capabilities were bult-in. In the context of neural networks, for example, translational invariance can be built-in by constraining weights of connections [15, 16]. Some researchers used natural image statistics to infer the underlying structure of space without inferring transformations. For example, ideas of redundancy reduction applied to natural images, such as independent component analysis or sparse features, lead to unsupervised learning of localized retinal receptive fields [17] and localized oriented features, both in spatial [18] and spatio-temporal [19] domains. As we said, transformation (or transformation-invariant) learning has so far been implemented by taking advantage of temporal correlation in images. In Refs. [12–14] transformation-invariant learning was achieved by incorporating delayed response to stimuli into Hebbian-like learning rules. By explicitly parametrizing affine transformations with continuous variables it was possible to learn them first to linear order in Taylor expansion [6] and then non-perturbatively as a Lie group representation [7, 8, 11]. In the context of energy-based models, such as Boltzmann machines, transformations can be implemented by means of three-way interactions between stochastic units. The transformations are inferred by learning interaction strengths [9, 10]. In all these cases the corresponding algorithms are fed with training examples (of possibly several unlabeled types) of transformations. Typically, images do not exceed 40 × 40 pixels in size. Below we demonstrate that image transformations can be learned without supervision, and without temporal ordering of training images. We consider both synthetic and natural binary images, achieving slightly better result for the synthetic set. Transformations are modeled as pixel permutations in 64 × 64 images. We see many possible modifications to our algorithm enabling more flexible transformation representation, more efficient learning, larger image sizes, etc. These ideas are left for future exploration. In the current manuscript, our main focus is on showing the feasibility of the proposed strategy in its basic incarnation.
3
Learning Transformations from Unordered Images
The basic idea behind our algorithm is extremely simple. Consider a pair of images and a transformation function. Introduce an objective function characterizing how well the transformation describes the pair, treating it as an imagetransform pair. Minimize the value of the objective function across a subset of pairs by modifying the subset and the transformation incrementally and iteratively. The subset is modified by finding better-matching pairs in the original set of images, using fast approximate search. We found that a simple hill climbing technique was sufficient for learning transformations in relatively large 64 × 64 images. Bellow we describe the algorithm in more detail.
Learning Image Transformations without Training Examples
3.1
171
Close Match Search
Let S be a set of binary images of size L × L. We sometimes refer to S as the set of random images. The images are random in the sense that they are drawn at random from a much larger set N , embodying some aspects of natural image statistics. For example, N could be composed of: a) images of a white triangle on black background with integer-valued vertex coordinates (|N | = L3 /3! images), b) L × L patches of (binarized) images from the Caltech-256 dataset [20]. We will consider both cases. Notice that our definition of S implies that it needs to be sufficiently large to contain pairs of images connectable by a transformation of interest. Otherwise such transformation cannot be learned. To learn a transformation at L = 64 we will need |S| to be in the order of 104 − 105 , with the number of close match searches in the order of 105 − 106 . Clearly, it is crucial to employ some efficient search technique. In a wide class of problems a significant speedup can be achieved by abandoning exact nearest neighbor search in favor of approximate nearest neighbor search, with little loss in quality of performance. Our problem appears to belong to this class. Therefore, approximate algorithms, such as best bin first [21] or locality sensitive hashing (LSH) [22], are potential methods of choice. LSH seems especially suitable thanks to its ability to deal with high-dimensional data, like vectors of raw pixels. On the flip side, LSH requires estimation of optimal parameters, which is typically done with an implicit assumption that the query point is drawn from the same distribution as the data points. Not only is that not the case here, the query distribution itself changes in the course of the algorithm run. Indeed, in our case the query is the image transform under the current estimate of the transformation. It gradually evolves from a random permutation, to something approximating a continuous 2D transformation. To avoid these complications we opt for storing images in binary search trees, while also creating multiple replicas of the tree to enhance performance in the spirit of LSH. Details are given below, but first we introduce a few notations. Let a L×L binary image be represented by a binary string x ≡ x1 ...xL2 , where xi encodes the color (0=black, 1=white) of the pixel in the i-th position (under some reference ordering of pixels). Let o be an ordering of pixels defined as a permutation relative to the reference ordering. Given o, the image is represented by the string x(o) ≡ xo(1) ...xo(L2 ) . We will refer to an image and its string representation interchangeably, writing xI (o) to denote an image I. Let B(o) be a binary search tree that stores images I ∈ S according to (lexicographic) order of xI (o). Rather than storing one image per leaf, we allow a leaf to contain up to m images (that is any subtree containing up to m images is replaced by a leaf). We construct l versions of the tree data structure, each replica with a random choice of oi , i = 1, ..., l. This replication is intended to reduce the possibility of a good match being missed. A miss may happen if a mismatching symbol (between a query string and a stored string) occurs too soon when the tree is searched. Alternatively, one could use a version of A* search, as in the
172
S. Pankov
best bin first algorithm, tolerating mismatched symbols. However, our empirical results suggest that the approach benefits from tree replication, possibly because information arrives from a larger number of pixels in this case. To find a close match to an image I, we search every binary tree B(oi ) in the usual way, using xI (oi ) as query string. The search stops at a node n if: a) n is a leaf-node, or b) search cannot proceed further (n lacks an appropriate branch). All the images from the subtree rooted at n are returned. In the final step we compute distance to every returned candidate and select the closest to I image. In short, in our close match search algorithm we use multiple binary search trees, with distinct trees storing images in distinct random orderings of pixels. The described approximate search yields a speedup of |S|/ml over the exact nearest neighbor search. For the values m = 5 and l = 10 that we used in our experiments (see section 4), the speedup was about 102 − 103 . 3.2
Transformation Optimization
We define image transformation T as a permutation of pixels t. That is T x = x(t). Despite obvious limitations of this representation for describing geometric transformations, we will demonstrate its capacity for capturing the essence of affine and elastic transformations. To be precise, our method in the current formulation can only capture a volume-preserving subset of these transformations. But removing this limitation should not be too difficult (see section 5 for some discussion). We denote a pair of images as (I, I ) or (xI , x I ). The Hamming distance between strings x and x is defined as d(x, x ) ≡ i (xi − xi )2 . The objective function dT , describing how well a pair of images is connected by the transformation T , is defined as: dT (I, I ) ≡ d(T xI , xI ).
(1)
Thus, the objective function uses the Hamming distance to measure dissimilarity between the second image and the transform of the first image. We will be minimizing dT across a set of pairs, which we call the pair set and denote it P. The objective function DT over P is defined as: dT (p), (2) DT ≡ p∈P
where we used a shorthand notation p for a pair from the pair set. We refer to the minimization of DT while P is fixed and T changes as the transformation optimization. We refer to the minimization of DT while T is fixed and P changes as the pair set optimization. In the transformation optimization phase the algorithm attempts to minimize DT by incrementally modifying T . A simple hill climbing is employed: a pair of elements from t are exchanged at random, the modification is accepted if DT
Learning Image Transformations without Training Examples
173
does not increase. A transformation modification effects every transformed string x(t) from P. However, there is an economical way of storing the pair set that makes computation of DT particularly fast. Consider the first images of all pairs from P. Consider a matrix whose rows are these images. Let xi be i-th column in this matrix. Define similarly xi , by arranging the second images in the same order as their pair counterparts. The objective function expressed through these vector notations then reads: 2
DT =
L
(xt(i) − xi ) (xt(i) − xi ).
(3)
i=1
If the i-th and j-th elements of t are exchanged, then the corresponding change (ΔDT )ij in the objective function reads: (ΔDT )ij = 2(xt(i) − xt(j) ) (xi − xj ),
(4)
which involves computing four terms of the form x a xb . For a binary image, as in our case, the vectors are binary strings and their dot-products can be computed efficiently using bitwise operations. Notice also that the vectors in Eq.(3) are unchanged throughout the transformation optimization phase, only t is updated. A transformation optimization phase followed by a pair set optimization phase constitutes one iteration of the algorithm. There are nt attempted transformation modifications per one iteration.
3.3
Pair Set Optimization
The goal of the pair set optimization is twofold. On one hand, we want P to contain pairs that minimize DT . On the other hand, we would like to reduce the possibility of getting stuck at a local minimum of DT . To achieve the first goal, we update P by adding new pairs, ranking all pairs according to dT and removing the pairs with highest dT . To add a new pair (I, I ) we pick image I at random from S, then search for I as a close match to T xI . To achieve the second goal, we add stochastic noise to the process by throwing out random pairs from the pair set. We denote nn and nr the number of newly added pairs and the number of randomly dropped pairs respectively, both per one iteration. After nn pairs are added and nr pairs are dropped, we remove nn − nr pairs with highest dT , so that the number of pairs |P| in the pair set remains unchanged. 3.4
Summary of the Algorithm
We briefly summarize our algorithm in Alg. 3.1. First, T and P are randomly initialized, then the procedure minimizeD(T, P) is called. It stochastically minimizes DT by alternating between transformation optimization and pair set optimization for total of ni iterations.
174
S. Pankov
Algorithm 3.1. minimizeD(T, P) for 1 ⎧ to ni for 1 ⎧ to nt ⎪ ⎪ ⎪ ⎪ (i, j) ← (rand(L2 ), rand(L2 )) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ exchange(i, j, T ) ⎪ ⎪ ⎨ do if deltaD(i, j, T, P) > 0 ⎪ ⎪ do ⎩ then exchange(i, j, T ) ⎪ ⎪ ⎪ ⎪ , P) addPairs(n ⎪ n ⎪ ⎪ ⎪ dropPairs(nr , P) ⎪ ⎪ ⎩ removePairs(nn − nr , P)
Calls to other procedures should be self-explanatory in the context of the already provided description: rand(n) generates a random integer in the interval [1, n], exchange(i, j, T ) exchanges the i-th and j-th elements of t, deltaD(i, j, T, P) computes (ΔDT )ij according to Eq.(4); finally, addPairs(n, P), dropPairs(n, P) and removePairs(n, P) adds random, drops random, and removes worst performing (highest dT ) n pairs respectively, as explained in subsection 3.3. As is often the case with greedy algorithms, we cannot provide guarantees that our algorithm will not get stuck in a poor local minimum. In fact, due to the stochasticity of the pair set optimization, discussing convergence itself is problematic. Instead, we provide convincing empirical evidence of the algorithm’s efficacy by demonstrating in the next section how it correctly learns a diverse set of transformations.
4
Results
We tested our approach on two image sets: a) synthetic set of triangles, b) set of natural image patches. These experiments are described below. 4.1
Triangles
Edges and corners are among the commonest features of natural scene images. Therefore a set of random triangles is a good starting point for testing our approach. The set S is drawn from the set N of all possible white triangles on black background, whose vertex coordinates are restricted to integer values in the range [0, L). For convenience, we additionally restricted S to contain only images with at least 10% of minority pixels. This was done to have betterbalanced search trees, and also to increase informational content of S, since little can be learned from little-varying images. Our goal was to merely demonstrate that this approach can work, therefore we did not strive to find best possible parameters of the algorithm. Some parameters
Learning Image Transformations without Training Examples
175
were estimated1 , and some were found by a bit of trial and error. The parameters we used were: L = 64, m = 5, l = 10, |S| = 30000, |P| = 200, nt = 10000, nn = 10, nr = 1, ni = 3000. We want to show that the algorithm can learn without supervision multiple distinct transformations that are representative of S. The simplest strategy is to generate transformations starting from random T and P, eliminating samples with higher DT to minimize the chance of including solutions from poor local minima. For more efficiency, compositions of already learned transformations can be used as initial approximations to T . Compositions can also be chosen to be far from learned samples. We found that for L = 64 poor solutions occur rarely, in less than approximately 10% of cases. By poor we mostly mean a transformation that appears to have a singularity in its Jacobian matrix. We chose to generate about three quarters of transformations from nonrandom initial T , setting ni = 1000 in such cases. Half of all generated samples were kept. In this way the algorithm learned about half a hundred transformations completely without supervision. All learned transformations looked approximately affine. Selected representative examples (for better quality additionally iterated with ni = 5000, |P| = 300 and |S| = 100000) are shown in Fig.(1.a). Since the human eye is very good at detecting straight parallel lines, we deemed it sufficient to judge quality of the learned affine transformations by visual inspection of the transforms of appropriate patterns. The transformations are visualized by applying them to a L × L portrait picture and checkerboard patterns with check sizes 32, 16, 8, 4 and 2. Since the finest checkerboard pattern is clearly discernible, we conclude that the achieved resolution is no worse than 2 pixels. With our choice of representing transformations as pixel permutations it is difficult to expect a much better resolution. Other consequences of this choice are: a) all the captured transformations are volume preserving, b) there are white-noise areas that correspond to pixels that should be mapped from outside of the image in a proper affine transformation. Nonetheless, this representation does capture most of the aspects of affine transformations. To better illustrate this point we plot in Fig(1.b) values of various parameters of all the learned examples. The parameters of an affine transformation ξ = Aξ + b are computed using: A = Cξ ξ (Cξξ )−1 ,
b = μξ − Aμξ ,
(5)
Where C and μ are the covariance matrix and the mean: Cab = ab − μa μ b and μa = a. The averaging ... is weighted by a Gaussian with standard 1
A rough estimate of set sizes goes as follows. Say, we want to infer a transformation at a resolution of ε pixels. A random triangle will have a match in S within this resolution if |S| ≥ (L/ε)3 . The transformation will be represented by P down to the required resolution if |P| ≥ L/ε. Even more hand-waving estimate of the algorithm loop sizes goes as follows. To ensure the incremental character of changes we need: nt L4 , nn |P|. To counter the threat of poor local minima we need P to be renewed many times, but not too fast, so ni |P|/nr and nr nn . These estimates should be viewed as no more than educated guesses.
176
S. Pankov
Fig. 1. Visualization of learned transformations for the set of triangles (a,b) and natural image patches (c). a) Selected examples of affine transformations. b) Set of learned transformations projected onto various planes in parameter space, top to bottom: bx by , sx -λ, λ-θ, θ-sx . c) A typical transformation visualized by 4×4 checkerboard pattern.
deviation σ = .1L centered at (L/2, L/2). The weighting is needed because our representation cannot capture an affine transformation far from the image center. We further parametrize A in terms of a consecutively applied scaling S, transvection Λ and rotation R. That is A = RΛS where: sx 0 1λ cos θ − sin θ S= , Λ= , R= . (6) 0 sy 01 sin θ cos θ
The parameters sx , sy , λ and θ expressed in terms of A are: sx = A211 + A221 , sy = Det(A)/sx , λ = (A11 A12 + A21 A22 )/(sx sy ) and θ = atan2(A21 , A11 ), where Det(A) = A11 A22 − A21 A12 . From Fig.(1.b) we see that the parameter values are evenly distributed over certain ranges without obvious correlations. Unexplored regions of the parameter space correspond to excessive image distortions, with not many images in S connectable by such transformations at reasonable cost. Also, |Det(A)| across all transformations was found to be .998 ± .004, validating our claim of volume preservation. 4.2
Natural Image Patches
In the second experiment we learned transformations from a set of natural images, derived from the Caltech-256 dataset [20]. The original dataset was
Learning Image Transformations without Training Examples
177
converted to binary images using k-means clustering with k = 2. Non-overlaping L×L patches with minority pixel fraction of at least 10% were included in N . We had |N | ≈ 500000. We used the following algorithm parameters: L = 64, m = 5, l = 10, |S| = 200000, |P| = 1000, nt = 10000, nn = 20, nr = 1, ni = 5000. Natural images are somewhat richer than the triangle set, consequently the transformation we learned were also richer. Typical transformation looked like a general elastic deformation, often noticeably differing from an affine transformation. White noise areas were much smaller or absent, while the resolution was lower at about 3 pixels. A typical example is shown in Fig.(1.c).
5
Discussion and Conclusion
In this paper we have demonstrated conceptual feasibility of learning image transformations from scratch: without image set or pixel set ordering. To the best of our knowledge learning transformations from unordered image dataset has never been considered before. Our algorithm, when applied to natural images, learns general elastic transformations, of which affine transformations are a special case. For the sake of simplicity we chose to represent transformations as pixel permutations. This choice restricted transformations by enforcing volume conservation. In addition, it adversely affected the resolution of transformations. We also limited images to binary form, although the learned transformations can be applied to any images. Importantly, we do not see any reason why our main idea would not be applicable in the case of a general linear transformation acting on continuously-valued pixels. In fact, the softness of continuous representation may possibly improve convergence properties of the algorithm. We plan to explore this extension, expecting it to capture arbitrary scaling transformations and to increase the resolution of learned transformations. Images that we considered were relatively large by standards of the field. For even larger images chances of getting trapped in a poor local minimum increase. To face this challenge we can propose a simple modification. Images should be represented by a random subset of pixels. Learning should be easy with a small initial size of the subset. In this way one learns a transformation at a coarse grained level. Pixels then are gradually added to the subset, increasing the transformation resolution, until all pixels are included. Judging from our experience, this modification will allow tackling much larger images. It seems advantageous for the efficiency of neural processing to factor high dimensional transformations, such as affine transformations, into more basic transformations. How the learned random transformations can be used to that end is another interesting problem. In our view, 3D rotations Rη → η can be learned in a similar fashion as we learned affine transformations, with orientations η playing role of pixels in the current work. The problem however is much harder since we do not have direct access to hidden variables η. Indirect access is provided through projected transformations A(R, η), where set of A is presumed to have been learned (apart
178
S. Pankov
from its dependence on the arguments R and η). We believe that the presence of multiple orientations in a given image and multiple images should constrain R and A sufficiently for them to be learnable. To conclude, we consider the presented idea of unsupervised learning of image transformation novel and valuable, opening new opportunities in learning complex transformations, possibly tackling such difficult cases as projections of 3D rotations. Acknowledgments. We gratefully acknowledge many useful discussions with Noam Shazeer and Georges Harik.
References 1. Barlow, H.B.: Unsupervised learning. Neural Computation 1, 295–311 (1989) 2. Hinton, G.E., Sejnowski, T.J. (eds.): Unsupervised Learning: Foundations of Neural Computation. Computational Neuroscience. MIT Press, Cambridge (1999) 3. Zemel, R.: A Minimum Description Length Framework for Unsupervised Learning. PhD thesis, University of Toronto (1993) 4. Oja, E.: Unsupervised learning in neural computation. Theoret. Comput. Sci. 287, 187–207 (2002) 5. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24, 509–522 (2002) 6. Rao, R.P.N., Ballard, D.H.: Localized receptive fields mediate transformationinvariant recognition in the visual cortex. In: Univ. of Rochester (1997) 7. Rao, R., Ruderman, D.L.: Learning lie groups for invariant visual perception. In: Advances in Neural Information Processing Systems 11, pp. 810–816. MIT Press, Cambridge (1999) 8. Miao, X., Rao, R.P.N.: Learning the lie groups of visual invariance. Neural Computation 19, 2665–2693 (2007) 9. Memisevic, R., Hinton, G.E.: Unsupervised learning of image transformations. In: Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 10. Memisevic, R., Hinton, G.E.: Learning to represent spatial transformations with factored higher-order boltzmann machines. Neural Computation 22, 1473–1492 (2010) 11. Sohl-Dickstein, J., Wang, J.C., Olshausen, B.A.: An unsupervised algorithm for learning lie group transformations. CoRR abs/1001.1027 (2010) 12. F¨ oldi´ ak, P.: Learning invariance from transformation sequences. Neural Computation 3, 194–200 (1991) 13. Wallis, G., Rolls, E., Foldiak, P.: Learning invariant responses to the natural transformations of objects. In: Proceedings of 1993 IEEE International Conference on Neural Networks (ICNN 1993), IEEE/INNS, Nagoya, Japan, vol. 2, pp. 1087–1090. Oxford U (1993) 14. Stringer, S.M., Perry, G., Rolls, E.T., Proske, J.H.: Learning invariant object recognition in the visual system with continuous transformations. Biological Cybernetics 94, 128–142 (2006) 15. Fukushima, K.: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36, 193–202 (1980)
Learning Image Transformations without Training Examples
179
16. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems 2 (NIPS 1989). Morgan Kaufmann, Denver (1990) 17. Atick, J.J., Redlich, A.N.: Convergent algorithm for sensory receptive field development. Neural Computation 5, 45–60 (1993) 18. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996) 19. van Hateren, J.H., Ruderman, D.L.: Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings. Biological Sciences The Royal Society 265, 2315–2320 (1998) 20. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology (2007) 21. Beis, J.S., Lowe, D.G.: Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In: Conference on Computer Vision and Pattern Recognition, pp. 1000–1006 (1997) 22. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Atkinson, M.P., Orlowska, M.E., Valduriez, P., Zdonik, S.B., Brodie, M.L. (eds.) Proceedings of the Twenty-fifth International Conference on Very Large Databases, pp. 518–529. Morgan Kaufmann Publishers, Edinburgh (1999)
Investigation of Secondary Views in a Multimodal VR Environment: 3D Lenses, Windows, and Mirrors Phanidhar Bezawada Raghupathy and Christoph W. Borst University of Louisiana at Lafayette
Abstract. We investigate secondary view techniques in a multimodal VR environment for dataset exploration and interpretation. Secondary views, such as 3D lenses or mirrors, can present alternative viewpoints, different filtering options, or different data sets. We focus on 3D views showing surface features that are hidden in a main view. We present different view techniques, including new variations, and experimentally compare them. Experiment subjects marked paths on a geological dataset in a manner that required a secondary view for a portion of each path. We compared passive to interactive (reach-in) views, rotated to mirrored presentations, and box vs. window shapes. We also considered two types of path complexity arising from surface contact geometry impacting force feedback, as the level of lateral guidance provided by the contact geometry may impact relative effectiveness of different view techniques. We show several differences in task times, error rates, and subjective preferences. Best results were obtained with an interactive box shape.
1 Introduction We investigate different ways of presenting a secondary view in a multimodal (multisensory) 3D environment and compare them for a path tracing task. We focus on secondary views displaying dataset regions hidden from the main view. In contrast to viewpoint or rendering changes in a main view, secondary views allow users to maintain a preferred view configuration and to simultaneously manage multiple projections, associated datasets, or filtering options (analogously to 2D windows). Understanding tradeoffs between different techniques and parameters will benefit VR-based scientific exploration applications such as geological interpretation. Although various researchers considered secondary views for VR (summarized in Section 2), there has been little evaluation of their effectiveness. Even when present, such evaluations have not directly compared the various view techniques in VR. The main contributions of this paper are: • •
We describe 3D secondary views, including variations not previously considered (e.g., a “reach-in” mirror, as opposed to a view-only mirror). We experimentally compare different secondary views and show: o Reaching in is very important: secondary views should include 3D interaction with viewed objects, not merely provide visuals. o Users prefer 3D boxes to window-like view shapes.
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 180–189, 2011. © Springer-Verlag Berlin Heidelberg 2011
Investigation of Secondary Views in a Multimodal VR Environment
o
o
181
For marking areas hidden from the main view: there can be differences between mirrored and rotated view presentations, depending on other factors such as task motion direction and hand orientation. In a multimodal interface, there is interaction between view effects and task difficulty related to contact geometry and force feedback.
Fig. 1. Different secondary views. From left: mirrored 3D box view, mirrored window view, rotated 3D box view, and rotated window view.
2 Related Works Various researchers considered secondary views, calling them windows [2, 3, 4, 5, 8], boxes [6], lenses [7], or mirrors [9, 10, 11, 12,13]. Viega et al. [7] extended 2D lenses into 3D “Volumetric Lenses”, where presentation in a box differed from surrounding view. Fuhrmann and Groller [6] refer to a similar concept as “Magic Box”. Borst et al. [8] describe it more generally as a “Volumetric Window” for managing multiple viewpoints. For simplicity, we call these views 3D boxes (Fig. 1). We focus on views that show hidden sides of objects by presenting rotated or mirrored views. Grosjean and Coquillart presented a “Magic Mirror” [9] analogous to a real mirror. Eisert et al. [10] and Pardy et al. [11] used virtual mirrors for augmented reality. König et al. [12] presented magic mirrors for volume visualization. Bichlmeier et al. [13] described a virtual mirror to reflect only virtual objects in augmented world. In this paper, we call these mirrored window views. We introduce “reach-in” mirrors and include an alternative called rotated window view (Fig. 1). Some techniques like World-in-miniature [1], tunnel window [4] and SEAMS [2] provide interaction or reaching in for manipulating distant objects or navigating between different virtual worlds. In our work, we use reach-in to surfaces that were already reachable without secondary views but that can’t be seen in the main view. Elmqvist and Tsigas [14] classified many techniques for 3D occlusion management, including techniques affecting the main view. For example, Flasar and Sochor [15] compared navigation (active) techniques for manipulating objects behind obstacles. In our work, we focus on more passive techniques that avoid affecting the main view, with results more aimed at understanding 3D windowing approaches. Numerous studies show that force feedback can affect performance. Typically these compare force feedback to no feedback, force-only to visual-only feedback etc.,
182
P. Bezawada Raghupathy and C.W. Borst
or haptic constraints to no haptic constraints. Some researchers, such as Faeth et al. [16], have used force feedback to aid operations on geological terrains. In our work, differences in force feedback are considered as they arise from different contact surface geometries and as effects may interact with view type.
Fig. 2. Left: VR system with force stylus and mirror-based 3D display, a common setup for colocated visuals and force feedback. Right: User’s view of terrain dataset and secondary view.
3 Implementation Notes Our multimodal environment (Fig. 2) renders visual and force feedback. Secondary views in 3D box shapes are rendered using techniques described in [8]. Secondary views with a window shape are rendered using standard stencil buffer mirror techniques, instead of texture mapping [9], to preserve depth and support reaching in. Both box and window shapes auto-orient based on head position and a point of interest (POI). The POI can depend on context. For example, it may be the position of a pointer so that the view follows the pointer. In our experiments, we define a fixed POI as the center of a bounding box of a path being traced. This keeps the path centered and visible in the secondary view. 3D box and window views differ in the way content is seen and changes. 3D box content depends only on the POI and centers it in the box with constant box-relative orientation, related to traditional volumetric view rendering. But, for the window shape, different content can be seen depending on pose of the window and the POI, related to the usual way of rendering mirrors (although autoorientation ensures that experiment paths are always centered in that view as well). For a 3D box, a rotated view involves rotating box content 180 degrees around a local box-centered and box-aligned vertical axis and a mirrored view reflects the rotated view on a local horizontal axis. For window shapes, a rotated view rotates the original scene 180 degrees around a window-centered-and-aligned vertical axis and a mirrored view is obtained by reflecting the original scene about the window plane. We automated view position to address manual placement bias for the experiment. Placement involves constraints with respect to a fixed reference coordinate system. Considering a fixed right-handed frame with X-axis being the VR display’s rightward
Investigation of Secondary Views in a Multimodal VR Environment
183
axis and Z-axis being its forward-facing axis (towards the user), a reasonable position (may not be optimal) for 3D box center (x, y, z - (depth of 3D box)/2) and window center (x, y, z) can be calculated as follows: • • •
x = x coordinate of center of a bounding volume of the path. y = highest y coordinate of point on the surface with respect to reference coordinate system plus (height of secondary view)/2 plus small positive offset. z = smallest z coordinate (farthest from user head) of the path minus the depth (z size) of a bounding volume of the path.
The Z offset makes the reach distance of window and 3D box approximately equal. The small Y offset moves the secondary view above the terrain. For force rendering, we use a simple penalty-based method: force magnitude is proportional to stylus tip distance below a mesh surface, and force direction is the interpolated surface normal at the surface point directly above the stylus tip.
4 Experiment Methods We conducted a within-subjects study comparing secondary views based on task time and error count (dependent variables) for a path tracing task. We also included a subjective preference session in which users compared certain conditions, switching a variable freely and indicating preference. For the objective portion, the independent variables, which produce 16 level combinations (conditions), are: 1. 2. 3. 4.
Reach mode (Reach-in, No reach-in) Transform (Rotated, Mirrored) Shape (3D box, Window) Geometric guidance (With, Without)
When in Reach-in mode, a user reaches interactively into the secondary view to trace a path section, but otherwise the secondary view is used just for visual reference. Different levels of transform and shape were discussed in section 3. Path tracing may be supported by surrounding geometry (with geometric guidance) or not (without geometric guidance). For example, tracing along a valley or crevice results in lateral force-feedback cues that may help keep the stylus positioned along the path, while tracing along a flat portion or over ridges lacks this guidance. Path tracing task: The task is representative of annotative marking for dataset interpretation. Although interpreters typically mark along features not yet marked, subjects traced an existing marked path to reduce cognitive and domain-specific aspects. Part of the each path was visible in the main view, but the remaining part was visible only in the secondary view. We were interested primarily in performance for tracing the hidden portion, including any time taken to transition between views. The study used four paths, shown in Fig. 3. Of the four paths, two have geometric guiding features in their hidden portions (left two of Fig 3) and the others do not. All paths had the same hidden portion length at the scale presented to subjects.
184
P. Bezawada Raghupathy and C.W. Borst
Hypotheses: Based on prior experience, we hypothesized that each independent variable was important and would impact performance or subjective preferences. We expected that reaching in improves speed and accuracy due to more direct interaction, even though transitioning between views involves extra time. We expected the 3D box view would be preferred for visual appearance but did not know if this would be reflected in performance. We speculated that mirrored view performs better than rotated view due to user familiarity with real-world mirrors. We expected that geometries producing lateral force guidance were easier to trace. Finally, we expected interactions, i.e., more notable effects when the task was difficult with respect to certain variables. For example, guidance would be more important when not reaching in.
Fig. 3. Paths in the study (viewpoint differs from experiment viewpoint for clarity). “S” and “E” were added to the figure to show start and end points. Circles were added to show a transition point beyond which the path was no longer visible in the main view from subject’s perspective. All hidden portions have the same length.
4.1 Apparatus We used a mirror-based “fish tank” display as shown in Fig. 2 to co-locate haptic and visual feedback. Its horizontal mirror reflects a monitor image so users move a stylus directly in a virtual space below the mirror. Monitor resolution was 1024 x 768 with 100 Hz refresh rate divided into left/right frames by CrystalEyes glasses. Head position was tracked with an Ascension Minibird tracker synchronized to the monitor refresh to reduce jitter. A Sensable Phantom Premium 1.5 provided stylus input and force feedback. The host machine was a standard Dell graphics workstation. 4.2 Subjects 24 subjects participated. 19 were male and 5 were female. Ages ranged from 21 to 38, with an average of 26. 22 subjects were right-handed and 2 were left-handed. 8 subjects reported previous exposure to VR, 11 reported moderate to high experience with video games and 5 reported minimal or no experience with video games. Most subjects were students from computer science and engineering programs. 4.3 Main Experiment The main experiment consisted of three sessions: 1. 2. 3.
Practice (8 practice trials) Session 1 (4 practice trials, 16 experimental trials) Session 2 (4 practice trials, 16 experimental trials)
Investigation of Secondary Views in a Multimodal VR Environment
185
Experiment duration (including a subjective preference session) was typically 35-40 minutes. After Session 1, subjects were given a two minute pause. 4.3.1 Procedure Per trial, subjects traced a path (Fig. 4) starting from a blue dot initially indicated by a blue arrow. The arrow vanished once the blue dot was contacted. The subject then traced the path and the contacted portion of the path turned black as it was traced. When the subject reached a pink-colored mark on the path, they switched focus to the secondary view, as the pink mark denoted the point after which the path was only visible in the secondary view. At that point, the subject either reached into the secondary view (reach-in) or used it only as visual reference (no reach-in) as directed by an arrow. The arrow disappeared once the subject reached in (reach-in) or passed the pink color (no reach-in). The subject then traced the remainder of the path to the end, denoted by a red dot, completing the trial. Additionally, whenever the subject moved off the path, recoloring stopped until the subject returned to the point where they left the path (threshold for both is 1.5mm from point on the path). Thus, there was no way to trace the path without moving through every point along it. Subjects were told to trace the path “as quickly as is comfortable”. A counter at the end of the virtual stylus indicated elapsed time.
Fig. 4. Different stages of path tracing. From left to right: before starting, at transition for no reach-in, at transition for reach-in, and after transition for no reach-in.
4.3.2 Condition Order and Randomization The order of 16 conditions in each of Sessions 1 and 2 was randomized with the following constraints. We minimized switching of reach mode by requiring the first 8 trials of a session to be either all reach-in or all no-reach-in cases (random per subject). Two practice trials reminded subjects of reach mode after each switch. Within each resulting reach-in and no-reach-in set, the first 4 trials were all either rotated or mirrored cases (random per reach mode block). Within each rotated and mirrored set, the first 2 trials were all either 3D box views or window views (random per transform block). Within each resulting box or window view, there was one path with geometric guidance and one without (random per shape block). The two remaining paths appeared in the corresponding conditions during the other session.
186
P. Bezawada Raghupathy and C.W. Borst
4.4 Subjective Preference Experiment In each of five preference trials following the main experiment, subjects indicated a preference after tracing a path and switching between techniques. Specifically, subjects compared reach-in to no reach-in cases (under randomized transform and shape), rotated reach-in to mirrored reach-in (shape randomized), 3D box reach-in to window reach-in (transform randomized), and no-reach-in versions of the latter two. Trial order was randomized per subject. Subjects could repeatedly trace the path and freely switch between relevant techniques by clicking a stylus button. Preference was indicated by a box click followed by a confirmation click.
5 Results and Discussion 5.1 Main Experiment Results Task time was calculated as the amount of time taken to trace the hidden part of the path (including transition time). Error count was calculated as the number of times a subject moved off the path. Figures 5 and 6 summarize task time and error count means. We analyzed results with four-way repeated-measures ANOVA per dependent variable with Bonferroni correction for post-hoc tests. Task time: Subjects traced paths faster when reaching in to secondary views than when using them only as a visual reference (F (1, 23) = 51.002, p < .001), with time averaging 56% shorter. Subjects were faster with 3D Box shapes than with window shapes (F (1, 23) = 26.319, p < .001), averaging 15% shorter task time. We detected no statistically significant effect of transform, overall, on task time F (1, 23) = 0.202, p = .657). Lateral geometric guidance improved task times (F (1, 23) = 31.849, p < .001), with average 34% task time reduction over other paths (same path lengths). There were significant reachmode-guidance and transform-guidance interactions, (F (1, 23) = 14.790, p < .001) and (F (1, 23) = 9.504, p = .005), respectively. We investigated interactions with reduced-variable ANOVAs at fixed levels of variables of interest. The increase in task time from guidance to no guidance averaged 20% for reach-in compared to 68% for no-reach-in, indicating guidance was especially important in the more difficult case of no reach-in. For transform-guidance interaction, mean task time for mirrored view was shorter with geometric guidance and longer without geometric guidance when compared to rotated view. Error Count: All independent variables affected error count. Subjects stayed on the path better when reaching in to a secondary view than when it was only a visual reference (F (1, 23) = 113.26, p < .001), with error count averaging 45% smaller. 3D Box shapes were better than window shapes (F (1, 23) = 16.691), with error count averaging 12% smaller. Overall, rotated views produced fewer errors than mirrored views (F (1, 23) = 9.986, p = .004), averaging 13% lower. Geometric guidance reduced errors (F (1, 23) = 34.779, p < .001) by an average of 25%. There were significant reachmode-guidance and transform-guidance interactions, (F (1, 23) = 18.049, p < .001) and (F (1, 23) = 20.456, p < .001), respectively. We investigated these as we did for task time interactions. Again, guidance was more
Investigation of Secondary Views in a Multimodal VR Environment
187
important in the no-reach case: there was a significant effect of geometric guidance for no-reach-in (F (1, 23) = 92.7, p < .001) but not for reach-in mode (F (1, 23) = .772, p = .389). And, transform was more important for no-guidance cases: there was a significant effect of transform for no-guidance cases (F (1, 23) = 26.095, p < .001), but not for guidance cases (F (1, 23) = 0.087, p = .771).
Fig. 5. Task time means and standard error bars for the 16 conditions
Fig. 6. Error count means and standard error bars for the 16 conditions
5.2 Subjective Preference Experiment Results For each subjective preference question, each subject was given a score of zero or one depending on the technique selected. We used one-parameter two-tailed z-tests to detect significant difference in mean score from 50% (the no-preference score). For reach mode: Significantly, all 24 subjects preferred reach-in to no reach-in. For transform: Significantly, 17 Subjects preferred the mirrored to rotated transform in no-reach-in cases (Z (24) = 2.041, p = .041). For reach-in cases, there was no statistically significant preference, with 12 subjects preferring each technique.
188
P. Bezawada Raghupathy and C.W. Borst
For shape: Significantly, 20 Subjects preferred 3D box shapes to window shapes in reach-in (Z (24) = -3.326, p < .001) and 17 Subjects preferred 3D box shapes to window shapes in no reach-in cases (Z (24) = -2.041, p = .041). 5.3 Discussion Our hypotheses are largely supported by results except that performance measures did not consistently favor mirror over rotated transform. The most promising secondary view is a reach-in 3D box. Regarding transform type (mirrored vs. rotated), subjective preference and objective results differ, and there may be other factors to consider. For example, in some applications, interpreters may want to see a view that preserves “handedness” of data, which is violated by mirrored views but not by rotated views. Even though reaching in to a secondary view requires additional transition time, both task time and error count were still reduced significantly by reaching in (averaging 56% and 45%, respectively), and subjective preference results unanimously supported reach-in mode over no-reach. There are two aspects that make the task more difficult without reaching in: the interaction is less direct (not co-located), and the secondary view that is being used is flipped along some axis with respect to the required hand motion (for both mirrored and rotated cases). 3D box shapes were better than window shapes in terms of task time, error count, and subjective preference. Note the 3D box technique provides more consistent content while window versions are more sensitive to position and orientation (Section 3). This makes placing windows more difficult, as there can be substantial deviation in viewed path orientation and depth with relatively small window position and orientation changes. Although we believe our window placements were good and well-matched to the 3D box versions, we cannot be sure they were optimal, and this illustrates the problem of sensitivity to placement. In real applications, there can be aspects preventing ideal placement, such as occluding objects or differences between ideal locations for reaching in (for comfortable depth) and ideal placement for viewed content orientation (so visuals match hand motion). Sensitivity to viewpoint is also a problem for headtracked VR. The lateral guidance from certain geometric features helps performance, but the extent depends on other view parameters. Paths with guidance significantly averaged 34% faster and 25% less error-prone. Geometric guidance had a stronger influence when subjects could not reach in to the secondary view (i.e., when the task was more difficult). Subjectively, subjects preferred mirrored over rotated views in no-reach mode but showed no preference in reach-in mode. Objective performance measures contrast this by showing the rotated view had lower error count. We believe that factors not explicitly studied affect performance results. For example, the position and direction of paths and handedness of subjects (affecting pen tilt) may be influential.
6 Conclusions and Future Work We discussed secondary views in a multimodal environment to overcome visual constraints (hidden features), and we compared different secondary views based on reach
Investigation of Secondary Views in a Multimodal VR Environment
189
mode, transform, and shape. Our study confirmed that a 3D box view with reach-in interaction was the best considered secondary view for a hidden path tracing task, and a mirrored view appeals to users when not reaching in. Surface geometry impacts performance, particularly when users don’t reach in: features that result in good lateral force cues help users overcome the indirect nature of no-reach interaction. For real terrain marking applications, the presence of such features hinges on the specific interpretation task, so it is important to optimize other view parameters for tasks where these features are lacking. Even though a mirrored secondary view was preferred based on subjective comparisons, the performance of mirrored and rotated views should be further studied by careful consideration of path orientations and right- and left-handed subjects.
References 1. Stoakley, R., Conway, M., Pausch, R.F.: Virtual Reality on a WIM: Interactive Worlds in Miniature. In: CHI, pp. 265–272 (1995) 2. Schmalstieg, D., Schaufler, G.: Sewing Worlds Together with SEAMS: A Mechanism to Construct Complex Virtual Environments. Presence, 449–461 (1999) 3. Robertson, G.G., Dantzich, M.V., Robbins, D.C., Czerwinski, M., Hinckley, K., Risden, K., Thiel, D., Gorokhovsky, V.: The Task Gallery: a 3D window manager. In: CHI, pp. 494–501 (2000) 4. Kiyokawa, K., Takemura, H.: A Tunnel Window and Its Variations: Seamless Teleportation Techniques in a Virtual Environment. In HCI International (2005) 5. Ware, C., Plumlee, M., Arsenault, R., Mayer, L.A., Smith, S.: GeoZui3D: Data Fusion for Interpreting Oceanographic Data. OCEANS, 1960–1964 (2001) 6. Fuhrmann, A.L., Gröller, E.: Real-time techniques for 3D flow visualization. IEEE Visualization, 305–312 (1998) 7. Viega, J., Conway, M., Williams, G.H., Pausch, R.F.: 3D Magic Lenses. In: ACM Symposium on User Interface Software and Technology, pp. 51–58 (1996) 8. Borst, C.W., Baiyya, V.B., Best, C.M., Kinsland, G.L.: Volumetric Windows: Application to Interpretation of Scientific Data, Shader-Based Rendering Method, and Performance Evaluation. In: CGVR, pp. 72–80 (2007) 9. Grosjean, J., Coquillart, S.: The Magic Mirror: A Metaphor for Assisting the Exploration of Virtual Worlds. In: SCCG, pp. 125–129 (1999) 10. Eisert, P., Rurainsky, J., Fechteler, P.: Virtual Mirror: Real-Time Tracking of Shoes in Augmented Reality Environments. In: ICIP (2), pp. 557–560 (2007) 11. Pardhy, S., Shankwitz, C., Donath, M.: A virtual mirror for assisting drivers. In: IV, pp. 255–260 (2000) 12. König, A., Doleisch, H., Gröller, E.: Multiple Views and Magic Mirrors - fMRI Visualization of the Human Brain. In: SCCG, pp. 130–139 (1999) 13. Bichlmeier, C., Heining, S.M., Feuerstein, M., Navab, N.: The Virtual Mirror: A New Interaction Paradigm for Augmented Reality Environments. IEEE Trans. Med. Imaging, 1498-1510 (2009) 14. Elmqvist, N., Tsigas, P.: A Taxonomy of 3D Occlusion Management Techniques. In: VR, pp. 51–58 (2007) 15. Flasar, J., Sochor, J.: Manipulating Objects Behind Obstacles. In: HCI (14), pp. 32–41 (2007) 16. Faeth, A., Oren, M., Harding, C.: Combining 3-D geovisualization with force feedback driven user interaction. In: GIS (2008)
Synthesizing Physics-Based Vortex and Collision Sound in Virtual Reality Damon Shing-Min Liu, Ting-Wei Cheng, and Yu-Cheng Hsieh Computer Science Department, National Chung Cheng University 168 University Road, Chiayi, Taiwan {damon,ctw98m,hych98m}@cs.ccu.edu.tw
Abstract. We present an integrated system for synthesizing realistic physically based sounds from rigid-body dynamic simulations. Our research endeavor is twofold, including vortex sound simulation and collision sound simulation. We synthesize vortex sound from moving objects by modeling air turbulences produced attributed to rapid object movements. In it, we precompute sounds determined by different velocity flows, and later use a lookup table scheme to retrieve the precomputed data for further synthesis. We also compute a modal model from prerecorded impact sounds to synthesize variations of collision sounds on the fly. Compared to using multiple prerecorded clips to provide sound variations, our system consumes less memory and can be further accelerated using SIMD instructions. Furthermore, we utilize OpenAL for fast hardware-accelerated propagation modeling of the synthesized sound. Keywords: Virtual reality, sound synthesis, audio rendering, physics simulation, interactive audio.
1
Introduction
One of the ultimate goals in computer graphics and virtual reality (VR) research is simulating virtual environments as realistic as possible. Besides visual simulations, audio simulation has also become an important component, and provides information that could not be provided by visual scenes. Therefore, many methods have been developed to automatically generate sound from their corresponding visual effects. These approaches can be roughly divided into two groups: sound propagation and sound generation. Sound propagation approaches focus on handling spatial sound effects relative to the surrounding geometry of the sounding object. On the other hand, sound generation approaches focus on generating realistic sound from physics phenomenon automatically. The goal of synthesizing sound in VR environments is producing realistic sound that corresponds with the visual scene. Some computer graphics techniques, such as rigid-body simulation or ray-tracing methods, are employed into the audio system to synthesize sound from the corresponding visual scene. Spatial sound simulation can be generated using ray-tracing methods to calculate sound propagation paths. Collision sound is usually synthesized by retrieving information from rigid-body simulations, such as positions of collision points and magnitudes of impact forces. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 190–198, 2011. © Springer-Verlag Berlin Heidelberg 2011
Synthesizing Physics-Based Vortex and Collision Sound in Virtual Reality
191
In this paper, we present an integrated method for synthesizing vortex sound and collision sound generated attributed to moving and colliding objects. We assume that sound from moving objects is caused by turbulences due to object movements [1]. We also generate collision sound by extracting representative frequencies from prerecorded clips. The rest of our paper is organized as follows. In Section 2 we describe how our method is related to the previous work. In Section 3 we give an overview of our techniques and describe their implementation. We then discuss results in Section 4, and conclude with ideas about future work in Section 5.
2
Related Work
In computer graphics, methods for synthesizing sound can be classified into two groups. One is simulating sound propagation, and the other is generating artificial sound. In the first group, sound reflection and absorption due to surrounding objects are simulated. Those researches focus on varying sound according to the surrounding geometry [2] [3]. Methods in the second group, in which our research is related to, compute sound waves generated by a vibrating source. One of the most popular research topics in computer generated sound is rigid-body collision sound simulation. These sound synthesis methods are always based on physics phenomenon. O’Brien et al. [4] deformed models and analyzed surface vibration to generate sound in 2001. They proposed [5] using modal analysis to synthesize sound in the following year. Modal analysis models the sound of an object as a summation of sinusoids generated by oscillators; each oscillates independently. Object’s stiffness, damping, and mass matrix determine these independent modes’ frequency. Chadwick et al. [6] simulated non-linear sound of thin shells based on modal analysis. They used linear modal analysis to generate a small-deformation displacement basis, and then coupled the modes together using nonlinear thin-shell forces. Picard et al. [7] proposed a new synthesis approach on complex contact sound. Their approach reuses visual texture of objects as a discontinuity map to create audible position-dependent variations during continuous contacts. In 2010, Ren et al. [8] enhanced complex contact synthesis by proposing a new three-level surface representation for describing objects, visible surface bumpiness, and microscopic roughness to model surface contacts. A more interesting topic is vortex sound, which is ubiquitous in reality. A basic theory on vortex sound was established by Lighthill in 1952 [9]. Dobashi et al. [10] proposed using Curle’s model [11] to synthesize sound produced by air turbulences. They modeled the behavior of air flow through a static object by analyzing incompressible Navier-stokes equation numerically, and computed Curle’s model using pressure caused on the object’s surface. Dobashi presented [12] another method for creating sound from turbulent phenomena, such as fire, in the following year. He considered that the complex motion of vortices in a turbulent filed would lead to vibrations, thereby producing sound. His method simulates a vortex sound by computing vorticity distributions using computational fluid dynamics. Dobashi used time-domain methods to simulate vortex sound, which needs shorter time steps to
192
D.S.-M. Liu, T.-W. Cheng, and Y.-C. Hsieh
simulate fluid dynamics. Although Dobashi used a sound texture map to solve this issue, it needs extra memory to store information. Our research work, instead, proposes a frequency-domain synthesis method to simulate vortex sound, which shortens time on computing the final outcome. This method has even better performance by cutting down the simulating frequency on fluid dynamics from auditory rates to visual rates.
3
Vortex Sound Simulation
We assume that aerodynamics sound is produced by vortex in neighborhood region surrounding that object [1]. We divide neighborhood region to many small pieces which we call cells using mesh construction and we consider each cell is an independent sound source whose frequency is relative to cell velocity and width scale of vortex . Summation of all sources can produce the resulting sound. In order to compute source condition, we first simulate fluid dynamics. 3.1
Fluid Simulation
Our system simulates fluid dynamics in every time interval. Before fluid simulation, we set boundary conditions to ensure fluid dynamics corresponds with object motions. We set velocity fields with the same speed, yet with opposite direction of the object’s movement. We then translate velocities of the boundary face from a world coordinate system to an object-related coordinate system; subsequently we simulate fluid dynamics. The fluid simulation receives position matrices at each time interval from the physics module and holds previous matrices in order to compute fluid boundary conditions. The first step of boundary setting is to calculate flow velocities passing boundary face. The velocity of the boundary is described as: M
V
M P
∆
∆t
,
(1)
where VBF is the velocity of a boundary face in world coordinate system, P is the position of a boundary face’s center in object coordinate system, MT+∆t is the current transform matrix of an object in world coordinate system, MT is the previous transform matrix of an object, and ∆t is the time interval. We translate world coordinate system to object coordinate system owing to -1
multiply an inverse matrix MT+∆t with the velocity of a flow passing the boundaries. Boundary velocity is computed using Equation (2), where I is an identity matrix and is boundary velocity in object coordinate system. V
M
∆
M
V
∆
M
M
∆
P
∆t
∆
∆
∆
∆
(2) ∆
∆
.
After setting the boundaries, we simulate fluid dynamics with incompressible flow. The equation of an incompressible flow is described as:
Synthesizing Physics-Based Vortex and Collision Sound in Virtual Reality dU dt
+
U - ·μ = -
,
193
(3)
is the pressure of the flow and U is the where μ is the viscosity of the flow, velocity of the flow. We use OpenFOAM [13] to simulate this equation and get a velocity field of the fluid. We then translate the velocity field from object coordination system back to world coordination system, and send the velocity field to the sound generating module. 3.2
Source Frequencies
In 1878, Strouhal performed an experiment with a circular cylinder and found the relationship: fd⁄v =St ,
(4)
where f is the frequency, v is the speed of the flow, and d the diameter of cylinder is called the Strouhal number and is about 0.2 for the circular cylinder. The [14]. theory of vortex sound refers to the relationship between eddy frequency and eddy width scale [1]: f ~ v ⁄l ,
(5)
where f is the frequency, v is the velocity, and l is the eddy width scale. We assume that each cell is an independent source and sound is produced by summing sinusoids from each cell. We also assume that cell frequency is proportional to the cell’s velocity and is inverse proportional to the width scale of intersection of an object model and the vorticity surface. Vorticity surface is a surface which is perpendicular to eddy surface and its normal is the vorticity of a flow. Cell velocity is obtained by fluid simulation and width scale is obtained by method as shown in Fig. 1. First, we compute voritcity for each cell and find vortex surface. We intersect vortex surface and object model and find width scale in this intersection. The width scale direction is perpendicular to vorticity and cell center velocity. We compute width scale estimation vector using cross product of vorticity and cell velocity and then find width scale.
Fig. 1. Schematic of width scale assumption
194
3.3
D.S.-M. Liu, T.-W. Cheng, and Y.-C. Hsieh
Source Amplitudes
According to Theory of Vortex Sound [1], the numerical analysis of sound radiation from an acoustically compact body in high Reynolds number turbulent flows is described as: p(x,t)≈
-ρ0 xj
∂
4πc0 |x|2 ∂t
w v
y,t-
|x| c0
Yj y d 3 y .
1,2,3 ,
(6)
is the density of the propagate medium, where p is the pressure which user receives, is the sound’s velocity, x is the listener’s position, w is vorticity, v is the flow’s is called Kirchhoff vector, velocity, y is any point in turbulent region, j is axis and which is equal to the velocity potential of incompressible flow passing the object having with unit speed in , ,and -axis directions at large distances from object. We regard cells described above as independent sources and every cell has uniform vorticity and velocity within cell. Observing Equation (6), we assume that pressure magnitudes are mainly determined by the inner product of vorticity and velocity. The amplitude of cell source is described as: A=V*|w v| ,
(7)
where A is the amplitude, V is the cell’s volume, w is vorticity, and v is the cell’s velocity.
4
Rigid-Body Sound Synthesis
Object collisions would certainly generate vibrations throughout its entire voxel; this vibration will be propagated out through air as a form of pressure waves. When heard by our ears, it is so-called sound. Assuming that pressure waves, caused by impact forces, would transfer through an object, we can use a generalized wave equation to model vibrations: ∂2 w ∂w 2 ∂2 w ∂2 w ∂2 w -k =v , ∂t ∂x2 ∂y2 ∂z2 ∂t2
(8)
where , , , is the pressure of a point in the voxel at position ( , , ) in Cartesian coordinates and time ; v is the velocity of wave, k is the damping factor. Using this model, we are able to define the value and location of impact in this simulation. Compared to spring-mass systems developed by Raghuvanshi et al. [15], or other FEM systems [4][5][16], using wave equation-based model simplifies vibration models, and can be further accelerated with finite differencing the PDEs [17]. Despite of the advantages, it ignores sheer forces caused by adjacent units in real objects. Ignoring such forces influences the final result, and could lead to huge error compared with sound recorded from real objects. 4.1
Modal Synthesis
Lloyd et al. [18] exploited an alternative approach to extract characteristic sinusoids of a recorded sound clip and vary their amplitude to synthesize sound, which is
Synthesizing Physics-Based Vortex and Collision Sound in Virtual Reality
195
similar to the spectral modeling synthesis (SMS) [19] approach. These methods model frequency spectrum of a sound rather than the underlying physics that generated it. We combined the idea above with a frequency domain synthesis method proposed by Bonneel et al. [20] to generate contact sound on the fly. Mathematically, the synthesized signal of x can be computed as: sin 2
x t =
,
,
(9)
, , and , are the gain, frequency (Hz), where M is the number of modes and and initial phase of each mode m, respectively. Using Discrete Fourier Transforms (DFT), we are able to analyze sound signals in a different aspect. We find it hard to distinguish characteristics of most audio data in their waveform, since real signals are complicated and often vary in time. In contrast, their Fourier coefficients, which represent frequency responds, are easy to identify. For a complex array with size n, we can get a complex array using DFT, where: N-1
Xn e-i2πkn/N .
Yk =
(10)
n=0
The inverse DFT (IDFT) of array x can be described as: N-1
Yk ei2πkn/N .
Xn =
(11)
n=0
We use Fast Fourier Transform (FFT) algorithms to speed up DFT operations; once we transform data from time-domain to frequency-domain, we can indicate the concentration of frequencies in the signal with its spectrum. Although it is easier for us to reveal significant frequencies from our input data in its spectrum, we still face a problem that, since real data are often not periodic, we get huge error if we simply apply Inverse FFT on the signal, and we lose the variation of frequencies throughout time in our data. In signal processing domains, a process called Short Time Fourier Transform (STFT) is usually used to get information on both time and frequency aspects. Usually STFT is performed by applying a window function on the input signal, and perform FFT with the window function W : N-1
Xn Wn-m e-i2πkn/N .
Y m,k =
(12)
n=0
If we use a rectangular window, where W
1, Equation (12) becomes:
N-1
Xn e-i2πkn/N ,
Y m,k = n=0
(13)
196
D.S.-M. Liu, T.-W. Cheng, and Y.-C. Hsieh
which is called Rec-STFT, a special case of STFT, is equivalent to applying FFT to the original signal. Using Rec-STFT, we can simply break the original signal into chunks without convoluting with other window functions, which saves computational time while compromising with slight quality-loss on our results. Since our computations are highly data-parallelized, our system can be further accelerated using SIMD instruction sets on performing Rec-STFT. Our method is closely related to that of Bonneel et al. [20], which performs frequency domain synthesis with audio frames of 1024 samples and 50% overlap to avoid huge synthesizing time. According to Bonneel et al., this leads to 5 to 8 times speed up compared with traditional time-domain synthesis. Since the result of RecSTFT is divided into small chunks, we overlap the IFFT results in each chunk by 50% and add the overlapping part together to reconstruct the final signal. Such operation usually produces very noticeable clicks caused by discontinuities on frame boundaries. To avoid this artifact, a window function is used to blend adjacent frames; we applied a Hann window, which is a discrete probability mass function given by: w n
0.5 1
cos
2πn N 1
,
(14)
where N is the width of the window, n is an integer with values 0 n N 1. The amplitude of sound is determined by the magnitude of external force struck on the object. We had integrated our system with Bullet Physics Engine, a professional open source library, to handle object collisions and calculations on external forces.
5
Experimental Results
We have developed an integrated system that simulates both vortex sound and collision sounds, we have built several demonstrative examples on a typical PC using Intel’s core i5-680 CPU, with 6GB memories and Nvidia’s GT-210 GPU.
(a)
(b)
Fig. 2. Demonstration of ten bowling pins struck by different objects. We shoot bowling pins by using a teapot (a) and a candle (b) model.
Synthesizing Physics-Based Vortex and Collision Sound in Virtual Reality
197
The combination of contact and vortex sound makes it possible to generate a compelling virtual reality environment. We provide three prerecorded clips, as shown in Table 1, to simulate collision sounds. Using our method, we are able to analyze our input with little precomputing time, and are able to build a modal model corresponding to the original input. Our synthesis method requires shorter time during the simulation, making our work suitable for interactive systems. On the other hand, we synthesize vortex sound by simulating fluid behaviors from recorded object traces. As shown in Table 2, the execution time of those experiments is proportional to the number of cells used to simulate vortex sound. We compromise the quality of our output by reducing the numbers of cells without making perceptible degradation. Furthermore, we adjust our simulation time intervals from 44,100Hz to 30Hz, which substantially reduces our precomputing time, allowing us to generate vortex sound in a reasonable time frame. Table 1. Statistics of contact sound synthesis Sound Type
Wave file size (bytes)
Modeling time
Mode count
Synthesis time per collision
Metal Chime Wood
45,128 77,800 8,744
18 ms 39 ms 6 ms
110 285 20
2 ms 3 ms 1 ms
Table 2. Statistics of aerodynamics sound synthesis
Model Teapot Candle
6
Faces 1,020 1,662
Cells 18k 26k
Simulation period 16.3s 16.5s
CFD time 8m 23s 13m 19s
Synthesis sound time 2m 45s 8m 52s
Conclusion
We have presented an integrated system on synthesizing realistic vortex sound and collision sound from rigid-body dynamic simulations. We use OpenFOAM to synthesize vortex sound from moving objects by modeling air turbulences produced attributed to rapid object movements. We also use modal synthesis to synthesize collision sounds in real time, with little loss in perceived sound quality. Our approach enables us to save memory, easy to implement, and takes advantage of existing hardware acceleration. We plan to improve our system by providing a precise calculation on width scale to determine vortex cell frequencies more accurately, to simulate aerodynamics in real-time, and to speed up modal synthesis by applying LOD methods in the synthesizing step. Acknowledgments. We thank members of Innovative Computing and Visualization Laboratory for providing many innovating and inspiring discussions.
198
D.S.-M. Liu, T.-W. Cheng, and Y.-C. Hsieh
References 1. Howe, M.S.: The Theory of Vortex Sound. Cambridge University Press, Cambridge (2003) 2. Funkhouser, T., Carlbom, I., Elko, G., Pingali, G., Sondhi, M.: West. J.: A Beam Rracing Approach to Acoustic Modeling for Interactive Virtual Environments. In: Proc. of ACM SIGGRAPH, pp. 21–32 (1998) 3. Taylor, M.T., Chandak, A., Antani, L., Manocha, D.: RESound: Interactive Sound Rendering for Dynamic. In: 17th International ACM Conference on Multimedia, pp. 271– 280 (2009) 4. O’Brien, J., Cook, P., Essl, G.: Synthesizing Sounds from Physically Based Motion. In: SIGGRAPH 2001 Conference Proceedings, pp. 529–536 (2001) 5. O’Brien, J.F., Chen, C., Gatchalian, C.M.: Synthesizing Sounds from Rigid-Body Simulations. In: SIGGRAPH 2002, pp. 175–182 (2002) 6. Chadwick, J.N., An, S.S., James, D.L.: Harmonic Shells: A Practical Nonlinear Sound Model for Near-Rigid Thin Shells. ACM Trans. Graph, 1–10 (2009) 7. Picard, C., Tsingos, N., Faure, F.: Synthesizing Contact Sounds between Textured Models. In: Fifth Workshop on Virtual Reality Interaction and Physical Simulation (2008) 8. Ren, Z., Yeh, H., Lin, M.C.: Synthesizing Contact Sounds between Textured Models. In: IEEE Virtual Reality Conference, VR 2010, pp. 139–146 (2010) 9. Lignthill, M.J.: On Sound Generated Aerodynamically: I. General Theory. Proc. Royal Society London A221, 564–587 (1952) 10. Dobashi, Y., Yamamoto, T., Nishita, T.: Real-time Rendering of Aerodynamic Sound using Sound Textures based on Computational Fluid Dynamics. In: ACM TOG 2003, pp. 732–740 (2003) 11. Curle, N.: The Influence of Solid Boundaries Upon Aerodynamic Sound. Proceedings of Royal Society London, A211, 569–587 (1953) 12. Dobashi, Y., Yamamoto, T., Nishita, T.: Synthesizing Sound from Turbulent Field using Sound Textures for Interactive Fluid Simulation. In: Eurographics 2004, pp. 539–546 (2004) 13. OpenFOAM, http://www.openfoam.com 14. Strouhal, V.: Uever eine besondere Art der Tonerregung. Ann. Phys. Chem (Wied. Ann. Phys.) 5, 216–251 (1878) 15. Raghuvanshi, N., Lin, M.C.: Interactive Sound Synthesis for Large Scale Environments. In: ACM SIGGRAPH Symp. on Interactive 3D Graphics and Games (I3D), pp. 101–108 (2006) 16. Chaigne, A., Doutaut, V.: Numerical Simulations of Xylophones. i. Time Domain Modeling of The Vibrating Bars. J. Acoust. Soc. Am. 101(1), 539–557 (1997) 17. Smith, G.D.: Numerical Solution of Partial Differential Equations: Finite Difference Methods, 2nd edn. Oxford University Press, Oxford (1978) 18. Lloyd, D.B., Raghuvanshi, N., Govindaraju, N.K.: Sound Synthesis for Impact Sounds in Video Games. In: ACM Proceedings I3D 2011: Symposium on Interactive 3D Graphics and Games, pp. 55–62. ACM Press, New York (2011) 19. Serra, X., Smith, J.: Spectral Modeling Synthesis a Sound Analysis/Synthesis based on a Deterministic plus Stochastic Decomposition. Computer Music Journal 14, 12–24 (1990); SMS 20. Bonneel, N., Drettakis, G., Tsingos, N., Delmon, I.V., James, D.: Fast Modal Sounds with Scalable Frequency-Domain Synthesis. ACM Transactions on Graphics (SIGGRAPH Conference Proceedings) 27, 3 (2008)
BlenSor: Blender Sensor Simulation Toolbox Michael Gschwandtner, Roland Kwitt, Andreas Uhl, and Wolfgang Pree Department of Computer Sciences, University of Salzburg, Austria {mgschwan,rkwitt,uhl}@cosy.sbg.ac.at,
[email protected]
Abstract. This paper introduces a novel software package for the simulation of various types of range scanners. The goal is to provide researchers in the fields of obstacle detection, range data segmentation, obstacle tracking or surface reconstruction with a versatile and powerful software package that is easy to use and allows to focus on algorithmic improvements rather than on building the software framework around it. The simulation environment and the actual simulations can be efficiently distributed with a single compact file. Our proposed approach facilitates easy regeneration of published results, hereby highlighting the value of reproducible research.
1
Introduction
Light Detection and Ranging (LIDAR) devices are the key sensor technology in today’s autonomous systems. Their output is used for obstacle detection, tracking, surface reconstruction or object segmentation, just to mention a few. Many algorithms exist which process and analyze the output of such devices. However, most of those algorithms are tested on recorded (usually not publicly available) sensor data and algorithmic evaluations rely on visual inspection of the results, mainly due to the lack of an available ground truth. Nevertheless, ground truth data is the key element to produce comparative results and facilitate a thorough quantitative analysis of the algorithms. Some authors tackle that problem by implementing their own sensor simulations, but most home-brewed approaches follow unrealistic simplifications, just using subdivision methods to generate point clouds for instance. The software we propose in this article represents an approach to tackle that shortcoming: we provide a unified simulation and modeling environment which is capable of simulating several different types of sensors, carefully considering their special (physical) properties. This is achieved by integrating the simulation tool directly into Blender1 , a 3-D content creation suite. With this combination it is possible to model the test scenarios with arbitrary level of detail and immediately simulate the sensor output directly within the modeling environment. The BlenSor 2 toolkit is completely integrated within Blender (see Fig. 1a) and does 1 2
http://www.blender.org http://www.blensor.org
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 199–208, 2011. c Springer-Verlag Berlin Heidelberg 2011
200
M. Gschwandtner et al.
(a) Parameters
(b) Exemplary scan simulation
Fig. 1. The sensor simulation interface is a part of the Blender GUI. It can be used just like any other feature of Blender: (a) every sensor has different parameters which can easily be modified and are stored in a .blend file; (b) example of a simple scan simulation. Single scans can be directly viewed and manipulated (and even analyzed) within Blender.
not require any custom scripts or tedious editing of configuration files to adjust the sensors. Yet, it is possible to access the underlying scanning functionality from custom code in case researchers want to modify the core functionality. The strong focus on offline data creation for algorithm development and testing allows BlenSor to focus on usability and features. BlenSor does not require to satisfy any external dependencies to enable compatibility with robotics frameworks for instance. The output is either i) written to a file (in a format explained in Section 3.6) or ii) added as a mesh within the sensor simulation. This facilitates direct interaction with the simulated (i.e scanned ) data. Even though realtime capabilities have been left out on purpose, the simulation can be used together with Blender’s physic engine, thus enabling to simulate complex scenarios with physical interaction of objects.
2
Previous Work
In [1], Dolson et al. generate range data for depth map upsampling by means of a custom OpenGL simulation. In [4], Meissner et al. simulate a four-layer laser range scanner using the ray-casting mechanism of the Blender game engine. Although, this is a fast and straightforward way of simulating a laser range scanner, it comes with the disadvantage of having to cope with restricted functionality of the game engine (e.g. limited set of materials, scalability issues, restrictions induced by graphics hardware, etc.). Bedkowski et al. [3] implement a custom simulation environment which provides an approximation of a laser scan performed by a LMS SICK 200. Their simulation however does not consider laser noise and is only a simulator which requires external modeling tools to create the scene that in turn is simulated. To the best of our knowledge, the most advanced simulation system is proposed by Echeverria et al. [2]. The authors provide an approach for realtime robotics simulation (named MORSE) using Blender as the underlying simulation environment. It supports several robotics frameworks
BlenSor: Blender Sensor Simulation Toolbox
201
and is meant for simulating the robots and studying their interaction with the environment. The sensors, particularly the LIDAR types, are just a means to an end for simulation rather than the core component itself. In addition to that, simulation of the sensors is relatively limited in terms of physical correctness, i.e. no noise or reflections, and no Time-of-Flight camera is available as well.
3
Sensor Simulation
Compared to robot simulation software ([2,7]), BlenSor focuses on simulation of the sensors itself rather than the interaction of sensor equipped robots with the environment. In fact, we are able to care a lot more about specific sensor properties, since there are no realtime constraints. Such properties are for example a realistic noise model, physical effects like reflection, refraction and reflectivity and sophisticated casting of rays that do not just describe a circle around the scanning center. The simulation accuracy can be increased with simple changes to the sensor code if features that are not yet available are required. The implementation details of the various sensor types in the following sections describe the simulation state at the time of writing. Due to the strong focus on offline simulation, we are able to simulate scenarios with a higher degree of detail than what is currently possible with existing robot simulators (e.g. MORSE ([2]). 3.1
Scanning Principle
All sensors simulated by BlenSor basically rely on the fact that the speed of light is finite and that light is at least partially reflected from most surfaces. To be more specific, the measured reflection is affected by i) the traveling distance of the emitted light, ii) the amount of light arriving at the sensor and iii) the concrete measurement time. In general, one or more rays of light are emitted from a range measurement device in the form of a light pulse. The rays travel along straight lines to a potential object. Once the rays hit an object, a fraction of the light gets reflected back to the sensor, some part gets reflected in different directions, and another part may pass through the object (in the case of transparent materials) in a possibly different direction. This is in fact closely related to ray-tracing techniques in computer graphics. Thus the modification of a ray-tracing program to match the sensor characteristics seems just natural. Although Blender provides an interface to cast rays from within the Python programming language, the functionality is limited and runtime performance inevitably suffers due to the computational demand to simulate a huge number of laser rays. BlenSor tackles this problem by patching the Blender codebase to provide a way to cast several rays simultaneously. It also allows Pyhton code to access material properties of the faces that are hit by the rays. For increased efficiency, reflection is handled directly within Blender. By using this interface, the sensors developed using the Python interface, can set up an array of ray directions and hand the actual ray-casting over to the patched Blender core. Then, a raytree is built by Blender to allow efficient ray-casting.
202
M. Gschwandtner et al.
(a) Backfolding
(b) Low Reflectivity
(c) Fully refl. surface
Fig. 2. Simulated features of different sensor types: (a) Backfolding effect of Time-ofFlight cameras; (b) Objects with low reflectivity (here: object in 50 meter distance); (c) Totally reflecting surfaces which cause points to appear farther away.
This modification processes all rays (and calculates reflections if needed) and returns the distances of the hits as well as the objectID for each ray. Eventually, the sensor code calculates sensor dependent noise and other physical features. This is described in the following sections. 3.2
Rotating LIDAR
A rotating LIDAR has a sensor/emitter unit rotating around the center of gravity and thus creates a 360◦ scan of the environment. As a representative of this class of sensor type, BlenSor implements a Velodyne HDL-64E S2 scanner. This sensor can detect objects with a (diffuse) reflectivity of 10% (= rlower ) at a distance of 50 meter (= dlower ) and objects with a (diffuse) reflectivity of 80% (= rupper ) at a distance of 120 meter (= dupper ). As already mentioned, the amount of light reflected back to the sensor depends on the distance of the object. The decrease in reflected light is compensated within the scanner electronic by lowering the threshold during the scan interval. Unfortunately, this process can not be correctly reproduced by BlenSor, since the information about threshold adaption is not available from the manufacturer. It is however possible to approximate this process by means of linear interpolation of the minimum required reflectivity. We use the 10% and 80% marks listed in the data sheet of the sensor. Objects closer than 50 meter are detected as long as their reflectivity is > 0%. Objects at a distance (dist) between 50 meter and 120 meter are detected if their reflectivity is rmin (dist), according to Eq. (1). These values can be easily adapted by the user if an empiric evaluation of the sensor provides different results than the information from the manufacturer. Or if the user wants to simulate a different environmennt like haze or fog. As this effect is calculated on a per-ray basis, it is even possible that a single object is only partially visible if it has a low reflectivity and is far away from the scanner (cf. Fig. 2b). rmin (dist) = rlower +
(rupper − rlower ) · dist dupper − dlower
(1)
Once all rays have been cast, we have to impose sensor specific errors to the clean measurements (distreal ). Our error model currently consists of two parts:
BlenSor: Blender Sensor Simulation Toolbox
203
first, a distance bias (noisebias ) for each of the 64 laser units. This bias remains the same in each rotation but the noise characteristics can be changed by the user. Experiments with a real Velodyne HDL-64E S2 revealed that the reported z-distance of a plane normal to the laser’s z-axis may differ up to 12 centimeter for any two laser units (combination of a laser and a detector). This is close to the actual numbers provided in the sensor fact sheets. The second part of our error model accounts for the fact that each single measurement (distnoisy ) is subject to a certain noise as well. Thus a per-ray noise (noiseray ) is applied to the distance measurements. The final (noisy) distance is formally given by distnoisy (yaw, pitchi ) = distreal (yaw, pitchi ) + bias,i + ray
(2)
with bias,i ∼ N (0, σbias ) and ray ∼ N (0, σray ), where N (μ, σ) denotes a Normal distribution with mean μ and variance σ. 3.3
Line LIDAR
As representative for the Line LIDAR type sensors BlenSor implements a hybrid scanner that can be best described as a combination of an Ibeo LUX and a SICK LMS sensor with a few modifications. According to the fact sheet of the Ibeo LUX sensor it can detect obstacles with a (diffuse) reflectivity of 10% up to 50 meter and has an average scanning distance of about 200 meter. The basic principle of measuring distances is described in Section 3.2. A Line LIDAR, however, implements a slightly different method to direct the rays. In contrast to the Velodyne HDL-64E S2 scanner, the line scanner has fixed laser emitters which fire at a rotating mirror. Depending on the position angle of the mirror, the rays are reflected in different directions. The measurement itself is the same as most other laser-based time of flight distance measurement systems. We highlight the fact that the rotating mirror does not only affect the yaw angle of the laser beams but also the pitch angle. In its initial position (i.e. yaw is 0◦ ) the mirror reflects the rays at the same yaw angle and with the same pitch angle between the rays as they are emitted by the lasers (cf. Fig. 3a). When the yaw angle of the mirror is in the range [0◦ , 90◦ ], the rays have a yaw and pitch angle which is different from the angles when emitted by the lasers (cf. Fig. 3b). Finally, when the mirror reaches a yaw angle of 90◦ , the pitch angle of all lasers becomes the same. The former pitch angle between the lasers has become the yaw angle between the lasers (cf. Fig. 3c). The noise model for the measurements is the same as in Section 3.2 due to the same scanning principle. 3.4
Time-of-Flight (ToF) Camera
In contrast to the LIDAR sensors of Sections 3.2 and 3.3, a ToF camera does not need a narrow focused beam of light for its measurements. Consequently, ToF cameras do not use lasers to emit the light pulse. Instead, the whole scene is illuminated at once and the Time-of-Flight is measured with a special type
204
M. Gschwandtner et al.
(a) α = 0◦
(b) α ∈ [0◦ , 90◦ )
(c) α = 90◦
Fig. 3. The pitch and yaw angle of the outgoing rays is affected by the different yaw angle α of the mirror as it rotates. Only in the mirror’s initial position, the angles of the rays are not affected.
of imaging sensor. Compared to the LIDAR sensors, a ToF camera has the advantage of a substantial increase in resolution, however, at the cost of limited measurement distance. In terms of simulation, a ToF camera does not differ much from the other sensors, though. The sensor has a per-ray noise but a higher angular resolution. While LIDAR sensors take a full scanning cycle (i.e. rotation) until they scan the same part of the environment again, subsequent scans of a ToF camera scan the same part of the environment. This may lead to ambiguities in the distance measurements. A signal from one scan may be received in the subsequent scan causing a wrong result. This effect is called Backfolding: objects at a certain distance may appear closer than they really are (cf. Fig. 2a). Backfolding can be enabled in BlenSor which causes all distance measurements in the upper half of the maximum scanning distance to be mapped into the lower half according to distreal , distreal < maxdistance 2 (3) distbackf olding = distreal − maxdistance , else. 2 3.5
Reflection
A special property of all supported sensor types is the total reflection of rays. If a ray hits a reflecting surface it does not immediately produce a measurement. Instead, the ray is reflected at the intersection point with the object and may hit another object at a certain distance. The ray might get reflected again, or not hit an object within the maximum scanning range. Figure 2c illustrates the case when several rays reflected from an object hit another object with a reflectivity above the necessary measurement threshold. As a result, the measured points appear farther away than the object because the rays did actually travel a greater distance. The sensor, however, does not know this fact and consequently projects a virtual object behind the real one.
BlenSor: Blender Sensor Simulation Toolbox
3.6
205
Ground Truth
An important advantage of BlenSor is the ease at which the ground truth for the simulated scenes can be generated. BlenSor currently supports two output possibilities: 1. The information about the real distance of a ray and the object identifier of the hit object is stored along with the clean & noisy real world data. Every measurement consist of 12 data fields. The timestamp of the measurement, yaw and pitch angle, the measured distance, the noisy distance, the x, y and z coordinates of the measured points (i.e. clean data), the coordinates of the noisy points and the objectID of the object that was hit. 2. BlenSor extends the Blender functionality to facilitate exporting of a floating point depth map, rendered at an arbitrary resolution. This depth map can then be used as a ground truth for many algorithms that work on 2.5D data, such as the work of Dolson et al. [1] for instance.
4
Building a Simulation
To build a static or dynamic scene for sensor simulation, we can rely on the standard tools of Blender. Any object can be added to the simulation and objects can be imported from other .blend files. This resembles the situation of a 3-D modeling artist building a scenery. Technically, there is no limit on the level of scene detail (except RAM of course), but too much detail will result in considerable simulation times. Some material properties (for example the diffuse reflection parameter) have an impact on the sensor simulation. The materials can be distributed through .blend files and we already made some available on the BlenSor website. This enables other researchers to reuse the materials in their own simulations. In BlenSor, the cameras are the placeholders for the actual sensor devices. Once the scene has been modeled and animated, the user selects a camera that is going impersonate the sensor, adjusts its physical properties and eventually simulates the scanning process. No editing of configuration files or any manipulation of scripts is necessary. The simulation is started and configured directly from the camera settings panel. If the simulation is run in single scan mode the user has the option to add the ground truth and/or the noisy real world data to the scene (cf. Fig. 1b). This allows for a direct visual verification of the simulation. The scene can be easily adjusted and scanned again. Different scans can coexist in BlenSor, thus allowing a direct comparison of different sensor parameters as well as the scene itself. 4.1
Using the Physics Engine
Physics simulation is possible through the internal physics engine of Blender. BlenSor can simulate any scene that can also be rendered. In order to simulate physical processes, we just need to set up the physics simulation and record the animation data while the physics simulation is running. This has the advantage
206
M. Gschwandtner et al.
that the physics simulation needs to be run only once, while the actual sensor simulation can be run as many times as necessary without the need to recalculate the physics. 4.2
Exporting Motion Data
To facilitate quantitative analysis of algorithms it is necessary to know the exact position and orientation of all (or at least several) objects in the simulation. The data of the objects can be exported as a text file describing the state of an object over the scan interval. The user can choose between exporting all, or only a selection of the objects in the scene. Exporting only selected objects may be beneficial for large and complex scenes. To export only selected objects the user literally selects one or more objects within Blender and calls the Export Motion Data functionality which was added by BlenSor.
5
Experimental Results
Our first experimental results in Fig. 4a show a crossing scene with four cars. The car closest to the camera is also the position of the sensor. To demonstrate the strength of BlenSor, we use the Velodyne HDL-64E S2 sensor to scan the scene. Figure 4b shows the scene scanned with MORSE, Fig. 4c shows the scene scanned with BlenSor. Compared to the BlenSor results, it is clearly visible that MORSE uses only a rudimentary simulation of the sensor. As a matter of fact, this is no real surprise since the primary focus of MORSE is on realtime simulation of whole robots and less on accurate simulation of sensors with all their properties. The BlenSor scan in contrast shows a much denser scan and a noise level similar to what we would expect with a real Velodyne HDL-64E S2 sensor. It is also important to note that the pitch angle of the laser sensors used by Velodyne is not evenly spaced. Relying on an exemplary calibration file provided by Velodnye, we distribute the pitch angles correctly. In our second experiment, illustrated in Fig. 5, we scan a fairly complex scene with 237000 vertices. The terrain has been modified by a displacement map to resemble an uneven surface (e.g. acre). Even though the scene is quite complex, the scanning time for a single simulation interval (in this case 40ms) is still between 4.9 and 12.8 seconds (see Table 1 for details). Scanning was done on a
(a) Rendered scene
(b) Sim. using MORSE
(c) Sim. using BlenSor
Fig. 4. Simulation of a simple scene with MORSE and BlenSor using the implemented Velodyne HDL-64E S2 sensor
BlenSor: Blender Sensor Simulation Toolbox
(a) Velodyne scan
(b) Ibeo scan
(d) Rendered scene
207
(c) ToF camera scan
(e) Ground truth
Fig. 5. Simulation of a scene with a large amount of vertices. The scene consists of a rough terrain, simulating an acre, with a near collision of two cars. The figures in the top row show the simulated sensor output of BlenSor, the figures in the bottom row show the rendered scene (i.e. the camera view ) as well as the ground truth (i.e. a 2000 × 2000 high-resolution depth map). Table 1. Processing time in seconds of different sensors in a complex scene Velodyne
Ibeo LUX
Time-of-Flight
Depthmap
8.462 [s]
4.943 [s]
5.290 [s]
11.721 [s]
Intel Core i5 2.53Ghz machine with 3 GB of RAM running a Linux 2.6.31-14 kernel. The average memory usage over the scan is 228 MB. 5.1
Reproducibility
One of the key motivations of developing BlenSor was to allow full reproducibility of research results. BlenSor stores all sensor settings in a .blend file. Further, the raw scan data can be provided as well in order to allow other researchers to make comparative studies without having to run the simulation again. Nevertheless, storing all needed information in one compact file makes it extremely easy to share the simulation setup. It further enables other researchers to easily modify, adapt or extend the scenarios. 5.2
Scalability
Although sensor simulation is usually a resource intensive task, smaller scenes are rendered almost in realtime by BlenSor. Larger and/or more complex scenes may
208
M. Gschwandtner et al.
require substantially more processing time, though. To cope with that problem, BlenSor is designed to allow distribution of the .blend file to multiple hosts by splitting the simulated time interval into corresponding sub-intervals. Since the parts are non-overlapping, each host (or thread) can work on its specific subinterval. Since we do not make use of GPU processing power (which is usually the case for simulators that rely on a a game engine), we can run several instances of simulation on a multi-core machine at the same time as well.
6
Conclusion
In this article we introduce a software tool for reproducible research in range data processing. Due to the strong linkage among simulation and modeling, creation of ground truth data is very simple. In fact, BlenSor considerably simplifies simulation of otherwise untestable scenarios (e.g. crashes). At the time of writing, all implemented sensor types already produce data that closely resembles the output of real sensors. We hope that this software encourages reproducible research in the respective fields and simplifies the distribution of test data for comparative studies. There is also good reason to believe that the functionality of BlenSor allows more researchers to develop algorithms for range scanner data without having to possess the physical sensor. Future work on BlenSor will also include support for the mixed-pixel error ([5,6]), refraction and, of course, additional sensors (i.e Hokyuo and SICK sensors).
References 1. Dolson, J., Baek, J., Plagemann, C., Thrun, S.: Upsampling range data in dynamic environments. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, CA, USA, pp. 1141– 1148 (2010) 2. Echeverria, G., Lassabe, N., Degroote, A., Lemaign, S.: Modular open robots simulation engine: Morse. In: Proceedings of the IEEE Conference on Robotics and Automation (ICRA 2010), Shanghai, China (2011) 3. Kretkiewicz, M., Bedkowski, J., Mastowski, A.: 3D laser range finder simulation based on rotated LMS SICK 200. In: Proceedings of the EURON/IARP International Workshop on Robotics for Risky Interventions and Surveillance of the Environment, Benicassim, Spain (January 2008) 4. Meissner, D., Dietmayer, K.: Simulation and calibration of infrastructure based laser scanner networks at intersections. In: Proceedings of the IEEE Intelligent Vehicles Symposium (IV 2010), San Diego, CA, USA, pp. 670–675 (2010) 5. Huber, D., Tang, P., Akinci, B.: A comparative analysis of depth-discontinuity and mixed-pixel detection algorithms, Los Alamitos, CA, USA, pp. 29–38 (2007) 6. Gregorio-Lopez, E., Sanz-Cortiella, R., Llorens-Calveras, J., Rosell-Polo, J.R., Palacin-Roca, J.: Characterisation of the LMS200 laser beam under the influence of blockage surfaces. influence on 3D scanning of tree orchards. Sensors 11(3), 2751– 2772 (2011) 7. Vaughan, R.: Massively multi-robot simulation in stage. Swarm Intelligence 2(2), 189–208 (2008)
Fuzzy Logic Based Sensor Fusion for Accurate Tracking Ujwal Koneru, Sangram Redkar, and Anshuman Razdan Arizona State University
Abstract. Accuracy and tracking update rates play a vital role in determining the quality of Augmented Reality(AR) and Virtual Reality(VR) applications. Applications like soldier training, gaming, simulations & virtual conferencing need a high accuracy tracking with update frequency above 20Hz for an immersible experience of reality. Current research techniques combine more than one sensor like camera, infrared, magnetometers and Inertial Measurement Units (IMU) to achieve this goal. In this paper, we develop and validate a novel algorithm for accurate positioning and tracking using inertial and vision-based sensing techniques. The inertial sensing utilizes accelerometers and gyroscopes to measure rates and accelerations in the body fixed frame and computes orientations and positions via integration. The vision-based sensing uses camera and image processing techniques to compute the position and orientation. The sensor fusion algorithm proposed in this work uses the complementary characteristics of these two independent systems to compute an accurate tracking solution and minimizes the error due to sensor noise, drift and different update rates of camera and IMU. The algorithm is computationally efficient, implemented on a low cost hardware and is capable of an update rate up to 100 Hz. The position and orientation accuracy of the sensor fusion is within 6mm & 1.5 ◦ . By using the fuzzy rule sets and adaptive filtering of data, we reduce the computational requirement less than the conventional methods (such as Kalman filtering). We have compared the accuracy of this sensor fusion algorithm with a commercial infrared tracking system. It can be noted that outcome accuracy of this COTS IMU and camera sensor fusion approach is as good as the commercial tracking system at a fraction of the cost.
1
Introduction
The goal of the tracking is to have a continuous estimate of 3D pose and position of the object/user of interest. The user AR/VR experience of the system depends on the accurate positioning of the objects in 3D. For tracking, we can use wide range of sensors. For example, vision-based camera/Infrared sensors, laser, inertial sensors, ultra wide band technology, RFID, radio frequency tagging, etc. Each sensor system has its own limitations which constrains it to a specific application. For instance, vision-based sensors have very good accuracy but have very low update frequency. Thus, they cannot be used for a highly dynamic tracking application or outdoors due to lighting conditions but serve G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 209–218, 2011. c Springer-Verlag Berlin Heidelberg 2011
210
U. Koneru, S. Redkar, and A. Razdan
very well in controlled environments. [1] researched this and listed the pros, constraints and best possible accuracy of several systems. In general, most augment reality applications would demand high update frequency and accuracy with minimal constraints. Nevertheless, this cannot be achieved by single sensor, but by using a combination of sensors that are complementary in nature. For example, camera and Inertial Measurement Unit (IMU) form a complementary sensor pair. A low cost vision sensor has a low frequency of update due to computation demands and line of sight constraint. Visionbased sensors also suffer from artifacts introduced in the images due to lighting conditions and dynamic motion of the camera. On the other hand, a low cost Micro-Electro Mechanical System (MEMS) IMU operates at a very high frequency of update (100 − 1KHz) and has very high precision of measurement (0.2mm & 0.36 ◦). All the same, accuracy of the system decays with time due to drift and noise. This error is common with all zero referencing systems. This paper addresses the sensor fusion of an IMU and camera to achieve a sub-centimeter position and sub-degree orientation accuracy with an update rate upto 100Hz. Using the inherent complementary nature and error characteristics of the sensors we try to minimize errors. This sensor fusion methodology is validated via experiments.
2 2.1
Previous Work Tracking Markers
Tracking systems can use retro-reflective markers, natural features or pre-defined markers for tracking. To meet our design goal of a rapid rate of detection and low cost, we cannot use the Infrared-based systems or natural markers.Owen et al. [2] researched the question what the best fiducial marker would be. They listed the requirements of a good fiducial marker being a large recognizable marker set having simple and fast detection, pose estimation and identification. The use of fiducial based markers attained wide spread popularity with the introduction of an open source library called ARToolkit. ARToolkit was developed by Hirokazu Kato and Mark Billinghurst, as part of an augmented reality video conference system [3]. The software is still actively updated by virtual reality researchers and hobbyists. ARToolkitplus [4] is an extension on the ARToolkit software library developed as part of the Handheld Augmented Reality Project [5]. It added more accurate tracking and environment resilient algorithms to the original ARToolkit software library. The new version includes adjustment the illumination threshold which is used to filter the marker. This improves the detection rate in indoor environments which can handle more specular and bloom effects due to artificial lighting. 2.2
Sensor Fusion Techniques
Kalman filtering is a widely used method for eliminating noisy measurements from sensor data during sensor fusion. Kalman filtering can be considered as a
Fuzzy Logic Based Sensor Fusion for Accurate Tracking
211
subset of statistical methods because of the use of statistical models for noise. Two Kalman filters are used by Paul and Wan[6] to ensure accurate state estimation and terrain mapping for navigating a vehicle through unknown environments. The state estimation process fused the information from three onboard sensors to estimate the vehicle location. Simulated results showed the feasibility of the method. Han et al. [7] used a Kalman filter to filter DGPS (Differential Global Positioning System) data for improving positioning accuracy for parallel tracking applications. The Kalman filter smoothed the data and reduced the crosstracking error. Based on the good results obtained in the previous research for fusion and noise reduction, initially a Kalman filter was chosen for our research as the method to perform the fusion and filter the noise in sensor measurements. The use of a Kalman filter with fixed parameters has drawbacks. Divergence of the estimates are common with fixed parameters, wherein the filter continually tries to fit a wrong process.(Fitzgerald[8]). To overcome the errors due to divergence in the noise, several extentions to kalman filter were proposed [9].Fuzzy logic is used in Abdelnour et al.[10] detecting and correcting the divergence. Sasiadek and Wang [11] used an extended Kalman filter to reduce the divergence for an autonomous ground vehicle. The extended Kalman filter reduced the position and velocity error when the filter diverged. The use of fuzzy logic also allowed a lowerorder state model to be used. For our research, fuzzy logic is used in addition to Kalman filtering to overcome divergence and to update the reliability parameter in the filter.
3
Quaternion Fuzzy Logic Based Adaptive Filtering
Linear and non linear Kalman filters try to solve the system by modeling the error. In fuzzy logic systems, the predictable outcome is used to determine the rule table using an objective analysis [12]. By using the varying or dependant parameters as input variables the system tries to find the best outcome at new inputs. Objective tests with varying rule bases are carried out to determine the best outcomes. Applying Fuzzy logic to any problem involves three major steps. These are: – Fuzzification process. – Inference using a rule base. – Defuzzification process. The fuzzification process assigns a degree of membership (represented as μ) to the inputs over the universe of discourse. The possible range input values are divided in to set of classes (e0 ...) and their boundaries (E0 , E1 , ...) are used to classify the input. Given a value, the degree of membership (a value between 0 and 1) for appropriate fuzzy classes are calculated. The degree of membership defines the state of the input in terms of fuzzy rules. μe0 = min(μE0 , μE1 )
(1)
212
U. Koneru, S. Redkar, and A. Razdan
The second step is inference using a rule base. A rule base is set of rules that are used to determine the outcome given a degree of membership in a set of fuzzy classes. For example, a role for IMU can be stated as “If the Gyro error is high and total acceleration is low, then use the accelerometer data to compute attitude and update the gyroscope output accordingly”. The fuzzy classes determine which inference rules apply for a given input. An inference table is built that enumerates all possible fuzzy class combinations depending on the type of application [13]. The precedent parts of the rule are ANDed before the inference rules are executed. The truth value for the ith rule is obtained by μij = μi [e0 ] ∗ μj [δt]
(2)
In the defuzzification step, we finally extract the output by using the inference rules. For tracking applications, the defuzzified output is the gain of the fusion system. The precedent and the inference rule undergo binary ’AND’ to form the defuzzified output for each rule base. The final value is the agrregation of outcome at every rule. Def uzzif iedoutput = (μij ∗ Rulebase[i][j]) (3) The block diagram for an IMU fusion algorithm is shown in figure 1. The fuzzy estimator block tunes the gain(K in the diagram) of the error correction loop. The error in the angular values and error rate of the system are used as an input to the fuzzy estimator. The fuzzy estimator, then determines the certainty of the degree of membership for each class. The defuzzified output is the tuned gain which is determined by considering the change of error and dynamics of the system [14]. The fuzzy classes and rules table for a sample pitch angle are shown in figure 3.1. If the error (e) in Euler angle k is zero, the degree of certainty, μ0 (center membership function), is one and all others are zero. As the error changes, the degree of certainty changes and other μ have non-zero values. Thus, the errors are encoded by the degree of certainty of their error bounds.
(a) Intra IMU Processing
(b) Complete system
Fig. 1. Block diagram of the Intra IMU complete system using adaptive error correction
Fuzzy Logic Based Sensor Fusion for Accurate Tracking
(a) Within IMU
213
(b) Between IMU and Camera
Fig. 2. Rule tables for adaptive filtering
The values for the error bounds (E1 , E2 ) can be determined using center clustering techniques. Likewise, input membership functions are determined for the change in error. For five error input membership functions and five change in error input membership functions, twenty five rules result as seen in figure 3.1. Any membership function with a non-zero degree of certainty is said to be ’on’ and the corresponding rule is also active. If both the error and change in error were small enough to be within the smallest error bounds (−E1 to +E2 ). If e is zero and change in e is zero then correction is zero. The certainty of the premise, i, is given by: 1. μi = min(μe0 , μΔe0 ) In general, the rules are given as: 2. If μ ˜ei is Ajei and μ ˜Δei is AkΔel then εi = gi (•) and ε˙i = hi (•) the symbol, “•”, simply indicates the AND argument. 3.1
Sensor Fusion of Camera and IMU
To extend this fuzzy logic sensor fusion idea to the IMU-camera sensor fusion, we need to study error characteristics of both devices. Camera data has good accuracy with the distance measurement than attitude computation. On the other hand, accelerometers cannot be used to find the position for longer periods as the drift incurred by the measurement is integrated twice to compute position. The camera input can be used along with accelerometer equations to minimize the drift in the gyroscope. Under near static conditions, angles computed from camera measurement are used to correct the gyroscope output directly. The block diagram of camera-IMU sensor fusion system is shown in figure 1. The corresponding rule table for the IMU-camera fusion is presented in figure 3.1. The tracking solution leans towards IMU if the camera updates are not valid or missed. When the camera updates are available and within error bounds for quasi-static object, the solution leans towards camera output.
4 4.1
Verification and Results Setup
For the prototype development, we chose a Sparkfun[15] six degrees of freedom IMU V4 with Bluetooth capability. This IMU V4 provides 3 axes of acceleration
214
U. Koneru, S. Redkar, and A. Razdan
Fig. 3. Sensor fusion hardware setup
data, 3 axes of gyroscopic data, and magnetic data. Each independent channel and sampling frequency is user selectable. The device can transmit in ASCII or binary format, and can work over a wireless Bluetooth link or a wired TTL serial connection. Control is provided through an LPC2138 ARM7 processor with additional memory for custom code development. For the prototype development, we use Bluetooth to transfer entire data to an external computer where the data is processed. The working frequency is set to 100Hz. This is an ideal frequency to meet the accuracy of the application and at the same time remain within the computational constraints of the system. A 3.8V battery pack is used as power source for the device. For vision sensor we use Point Grey’s[16] chameleon camera. It is a USB 2.0 digital video camera that can record up to 30 frames per second at 1024 ∗ 768 resolution. The camera uses a 1/3” Sony sensor for image capture. We chose a wide angle lens from Computar[17] for this camera. The camera is mounted along with IMU on a rigid body. The camera and IMU transmit data to the laptop using USB interface and Bluetooth respectively, as shown in Figure 3) The data acquired from the camera is sent to ARToolkitPlus[4] which extracts the marker position in the captured image. The 3D markers extracted are inverse transformed to get the camera coordinates. The relative motion of camera can be determined by tracking the changes in the camera coordinates over time. IMU data is filtered using FIR filters on the chip to remove noise and bias from the readings. The software for the sensor fusion is written in C and C++ for compatibility with the existing ARToolkit and Artoolkitplus libraries. 4.2
Results
The camera as a standalone device is tested to measure its accuracy and precision. The test environment had stable lighting conditions. Markers of varying size arranged as matrices are printed and used. Multiple sized markers allowed us to test the accuracy over long ranges such as 3 − 5m. The test runs involved a static case to observe the stability of the camera data. The rest of the cases are dynamic that involved simple rotation or translation maneuvers.
Fuzzy Logic Based Sensor Fusion for Accurate Tracking
215
(a) Camera observations for transla- (b) Camera observations of translation along x axis under static condi- tion along x axis under dynamic motions tion Fig. 4. Camera output plots showing the high noise in angle as compared to distance. The error is very stable giving consistent accuracy.
As we observe from figure 4(a) the camera data has significant noise. The standard deviation of static measurement is around 0.31mm in distance and 0.42 ◦ in angle. The noise is worse in angular measurement due to the limitation in floating point computations. After filtering the data the standard deviation in static case is around 0.28mm in distance and 0.34 ◦ in angle. Under motion, the camera data was fairly accurate as long as the velocity is less than 2cm in 1/30 second (per camera cycle) i.e., 60cm/second. Under such high velocities, noise grows exponentially due to blur in the image frames which often leads to false positives. The accuracy is around 0.8mm in distance and 1.5 ◦ in angle under controlled dynamic conditions. With motion rates beyond 60cm/second, the algorithm was unable to identify markers. One of the problems with the IMU sensor fusion is that only two angles, roll & pitch, can be updated from the accelerometer. Gyroscope along z axis is parallel to gravity vector and hence the yaw is never compensated by the accelerometer observations. The ’yaw’ accuracy falls and indirectly affects the accuracy of other values such as pitch, roll and translation values due to the interdependence. While, the error in pitch and roll are corrected from accelerometer over time, the error in the velocity and acceleration induced due to yaw can never be corrected by the system. So a permanent error is introduced which keeps on increasing. Yaw data over time is shown in figure 5. 4.3
Verification
To verify the algorithm performance and precision for complex dynamic motions in 3D, we used an Infrared (IR) based tracking system from Vicon[18]. The system uses seven IR cameras to cover a volume of 5m ∗ 5m ∗ 5m volume. It tracks markers that are 2.5mm in diameter with 0.2mm accuracy in distance and 0.35 ◦ in 3D angular measurements (see figure 5 for the setup).
216
U. Koneru, S. Redkar, and A. Razdan
(a) Yaw cannot be corrected from (b) Infrared system setup for valthe accelerometer idation Fig. 5. The Vicon Infrared setup has seven cameras covering a 5m * 5m * 5m volume
In the setup, the cameras and markers are calibrated accurately. The camera calibration involves the Matlab calibration toolbox. The camera parameter matrix is used by the ARToolkitplus in camera based transformations. The markers must be accurately calibrated for camera coordinate system to be defined accurately. The height and width of the markers need to be very accurate to compute the position of the camera coordinates. We used a ruler capable of 0.5mm precision to measure the marker height and width. The infrared tracking system from Vicon is provided with software modules for tracking by the Infrared system. They output directly measures the distance and angular observations. The outcome of the test cases is presented below. Figure 6 presents the test run of the Infrared verification setup. This test case involved two dynamic zones separated by a fairly static region. This static zone is intended to understand the time delay in adjusting the accuracy by the algorithm. The results have an accuracy of 6mm in distance and 1.3 ◦ in angle. However, all test cases had atleast one stable camera observation once every two minutes. Hence, the precision is guaranteed, given the system gets a camera update once every 2 minutes. The Fuzzy logic algorithm tunes very quickly as compared to Kalman filters based on the results presented in [6], [7]. The average observation needed with Kalman filter was around 3-4 seconds. This is due to the fact that a number of parameters have to be adjusted and recomputed during model prediction and estimation stages. However with Fuzzy logic, the algorithm tunes 4 times faster given a good rule base. The algorithm can be made to tune faster just by simply adjusting the rules base data based on the objective tests. It can be observed from figure 6(a) that IMU solution and IR system solution match fairly well when the system is dynamic. The IMU solution drifts from the IR solution when IMU is quasi-static due to noise, random walk and other sensor errors. It can be seem from figure 6(b) that camera tracking solution matches with IR system solution when the system is stationary or slowly moving. However, when we combine IMU and camera solution using the fuzzy logic algorithm presented in this paper the resulting solution matches very closely to the IR system tracking solution as shown in figure 6(c). In practice, we observed the camera needs to update IMU atleast once every 2 minutes.
Fuzzy Logic Based Sensor Fusion for Accurate Tracking
(a) Infrared and IMU data
217
(b) Infrared and camera observations
(c) Infrared and post sensor fusion data Fig. 6. Experiential data from the Infrared verification
5
Conclusion
In this paper, we presented two contributions related to the topic of sensor fusion. We proposed Fuzzy logic based adaptive algorithm that takes the factors determining the fusion and source of errors to extract a accurate position and pose. The algorithm scales well with constant increase in multiple sensors as compared to other techniques such as Kalman filter or Dempster - Shafer theory which would introduce additional variables and higher order terms in the computation process. While the inference tables and fuzzification process involves larger tables with increasing variables, by using advanced data structures such as hash tables and sets, one can keep the increase in computation overhead constant. Thus, the algorithm is highly scalable and practical. The second contribution is the development of prototype that demonstrated practical implementation of the algorithm. The prototype has achieved (i) Sensor fusion rate of 100Hz using a vision sensor at 30Hz and IMU at 100 (ii) Best accuracy among sensors by acting as an intelligent switch, verified to be 6mm for distance and 1.3 ◦ for angle (iii) Cost of the equipment is less than 1/10th of the cost of the infrared-based equipment in market.
218
U. Koneru, S. Redkar, and A. Razdan
References 1. Welch, G., Foxlin, E.: Motion tracking: No silver bullet, but a respectable arsenal. IEEE Computer Graphics and Applications 22(6), 24–38 (2002) 2. Owen, C., Xiao, F., Middlin, P.: What is the best fiducial. In: The First IEEE International Augmented Reality Toolkit Workshop, pp. 98–105 (2002) 3. Kato, H., Billinghurst, M.: Marker tracking and hmd calibration for a video-based augmented reality conferencing system. International Workshop on Augmented Reality, 85 (1999) 4. ARToolkitPlus, open source optical tracking software, http://studierstube.icg.tu-graz.ac.at/handheldar/artoolkitplus.php 5. Wagner, D., Schmalstieg, D.: Artoolkitplus for pose tracking on mobile devices. In: Proceedings of 12th Computer Vision Winter Workshop (CVWW 2007), Citeseer, pp. 139–146 (2007) 6. Paul, A., Wan, E.: Dual Kalman filters for autonomous terrain aided navigation in unknown environments. In: Proceedings of IEEE International Joint Conference on Neural Networks, IJCNN 2005, vol. 5 (2005) 7. Han, S., Zhang, Q., Noh, H.: Kalman filtering of DGPS positions for a parallel tracking application. Transactions of the ASAE 45, 553–559 (2002) 8. Fitzgerald, R.: Divergence of the Kalman filter. IEEE Transactions on Automatic Control 16, 736–747 (1971) 9. Subramanian, V., Burks, T., Dixon, W.: Sensor Fusion Using Fuzzy Logic Enhanced Kalman Filter for Autonomous Vehicle Guidance in Citrus Groves. Transactions of the ASAE 52, 1411–1422 (2009) 10. Abdelnour, G., Chand, S., Chiu, S.: Applying fuzzy logic to the Kalman filter divergence problem. In: Proc IEEE Int. Conf. Syst., Man, Cybern, IEEE, NJ(USA), vol. 1, pp. 630–634 (1993) 11. Sasiadek, J., Wang, Q.: Sensor fusion based on fuzzy Kalman filtering for autonomous robotvehicle. In: Proceedings of IEEE International Conference on Robotics and Automation, vol. 4 (1999) 12. Ling, Y., Xu, X., Shen, L., Liu, J.: Multi sensor data fusion method based on fuzzy neural network. In: IEEE 6th IEEE International Conference on Industrial Informatics, INDIN 2008, pp. 153–158 (2008) 13. Narayanan, K.: Performance Analysis of Attitude Determination Algorithms for Low Cost Attitude Heading Reference Systems. PhD thesis, Auburn University (2010) 14. Narayanan, K., Greene, M.: A Unit Quaternion and Fuzzy Logic Approach to Attitude Estimation. In: Proceedings of the 2007 National Technical Meeting of The Institute of Navigation, pp. 731–735 (2007) 15. Sparkfun, Inertial Measurement Unit(IMU) manufacturer, http://www.sparkfun.com/commerce/categories.php 16. Point Grey, CCD and CMOS cameras for research, http://www.ptgrey.com/ 17. Computar, optical lens manufacturer, http://computarganz.com/ 18. Vicon, Infra Red(IR) motion capture systems, http://www.vicon-cctv.com/
A Flight Tested Wake Turbulence Aware Altimeter Scott Nykl, Chad Mourning, Nikhil Ghandi, and David Chelberg School of Electrical Engineering and Computer Science Ohio University, Stocker Center Athens, Ohio, USA, 45701
Abstract. A flying aircraft disturbs the local atmosphere through which it flies creating a turbulent vortex at each wing tip known as a wake vortex. These vortices can persist for several minutes and endanger other aircraft traversing that turbulent airspace; large vortices are essentially invisible horizontal tornadoes and are a grave threat to smaller aircraft, especially during landing and take off. Accidents related to wake turbulence have resulted in both loss of life and aircraft destruction in the United States and around the world. Currently no cockpit instrumentation exists that tracks wake vortices and enables a pilot to sense and avoid wake turbulence in real-time. This paper presents a prototype of a novel, flight tested instrument that tracks wake vortices and presents this information to a pilot in real time using a synthetic virtual world augmented with wake turbulence information.
1 Motivation A flying aircraft disturbs the local atmosphere through which it flies creating a turbulent vortex at each wing tip known as a wake vortex. These vortices can persist for several minutes and endanger other aircraft traversing that turbulent airspace [1]. In the United States alone, over the decade spanning 1983 to 1993, at least 51 accidents and incidents resulted from wake vortices, killing 27, and destroying 40 aircraft [2]. In Europe, vortex incidents are also not uncommon; in recent years, the London-Heathrow International Airport has reported about 80 incidents per year [1]. Currently, no standardized instrumentation exists for pilots or air traffic controllers to precisely convey current wake vortex information to the cockpit. Instead, pilots in the United States use the policies and procedures published by the Federal Aviation Administration (FAA) in [3]. When using visual flight rules (VFR), where the pilot has good visibility, the pilot is responsible for tracking leading aircraft and mentally extrapolating their flight paths back to the pilot’s current position; subsequently, the pilot must fly above those extrapolated paths and touch down further along the runway than any leading aircraft [2, 3]. When using instrument flight rules (IFR), air traffic control is responsible for warning each aircraft about potential wake turbulence and providing each affected aircraft a modified approach path. The capacity of any given airport is limited by the separation distances of the aircraft in the local traffic; the closer the aircraft fly to each other, the higher the airport’s capacity and the greater the probability of wake turbulence. In VFR conditions, separation distances are relaxed as the pilot has visual confirmation of leading aircraft; this G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 219–228, 2011. © Springer-Verlag Berlin Heidelberg 2011
220
S. Nykl et al.
increases airport capacity, and also increases the likelihood of wake turbulence. In IFR conditions, separation distances are increased, thus reducing the airport capacity as well as reducing the likelihood of wake turbulence; in fact, as of 2002, all wake turbulence incidents in Europe occurred while using VFR rules, none occurred under IFR rules [1]. This has lead to postulations that IFR rules are too conservative and could be relaxed to increase airport capacity under IFR rules. A major goal of wake turbulence research is to realize the minimal required aircraft spacing for given atmospheric conditions at a given airport; systems that attempt to compute this reduction limit are called Reduced Separation Systems (RSS). The severity of wake turbulence experienced by an aircraft depends on four main factors. First, is the size/shape of the leading aircraft that created the vortices; the larger the aircraft, the larger the generated turbulence. The wake characteristics of a given vortex can be inferred by the type of aircraft responsible for creating them. Second, is the size/shape of the trailing aircraft and its flight path/orientation throughout the turbulent airspace; the smaller the aircraft, the more susceptible it is to severe turbulence. The trailing craft’s control power and wing configuration at the time of intersection affect the resulting behavior. Third, is the age of the vortices and the corresponding weather conditions in which they are embedded; vortices dissipate over time, typically within 3-5 minutes. Local weather conditions, such as heavy winds, tend to increase the dissipation rate; however, light winds (~5 knots) cause the vortices to drift down wind without an increased dissipation rate, this causes problems when two runways/flight paths are parallel and one is down wind from the other [3]. Fourth, the response of the pilot or auto-pilot is critical; inexperienced pilots and/or auto-pilot properties may over compensate as a result of the unexpected turbulence and modify the aircraft’s configuration in such a way as to further the negative impact of the vortex [1]. Much work has been done in vortex observation and mathematical models which describe vortex behavior in both the near field and far field [4–15]. Such available models combined with increased demands on airport capacity has given rise to several specific RSS systems. These systems include the “Wirbelschleppen-Warnsystem” (WSWS) System installed at the Frankfurt Airport [16, 17], the “Systeme Anticipatif de Gestion des Espacements” (SYAGE) system installed at the ToulouseBlagnac Airport [18], the Aircraft Vortex Spacing System (AVSS) developed by NASA installed at Dallas-Fort Worth Airport [19, 20], the “High Approach Landing System / Dual Threshold Operation” (HALS/DTOP) System installed at Frankfurt Airport [21], the “Simultaneous Offset Instrumented Approach” (SOIA) System developed by the FAA installed at the San Francisco Airport [22]. These systems were all specifically built to help air traffic control (ATC) reduce spacing and increase capacity; however, for various reasons, none of these specialized systems are currently in use [1]. Failure of these specialized systems exemplify the need to create an interoperable mechanism by which wake turbulence can be avoided and spatial separation can be minimized. Our system is based on the standardized and widely accepted ADS-B protocol [23]; based on the ADS-B information, the prototype provides conservative wake turbulence tracking and conveys wake turbulence information to the pilot concisely at a minimum update frequency of 1Hz. Since this system uses ADS-B, it is intrinsically an air-to-air system and does not require specialized ground mechanisms to be installed at each airport.
A Flight Tested Wake Turbulence Aware Altimeter
221
2 Wake Turbulence Aware Altimeter Given a leading and trailing aircraft, H and T , respectively, our prototype provides new instrumentation enabling T to proactively avoid the wake turbulence generated by H in real time. This visual instrumentation of wake turbulence eases demand on air traffic controllers while improving safety of aircraft flying within a close spatial and temporal proximity. Data for this system can be acquired through the Automatic Dependent Surveillance-Broadcast (ADS-B) protocol [23]; this protocol is widely accepted and is mandated by the FAA to be operational in all general aviation aircraft by 2020, and it is already widely adopted in Europe. Our prototype augments an existing altitude strip with additional information about the leading aircraft. Altitude information computed from the leading aircraft, H, is displayed as a semi-transparent bar augmenting T ’s existing altitude strip, see Fig.1; as shown, T ’s altitude is currently at 1025f t and the wake turbulence generated by H is currently at 960f t. Thus, T simply needs to hold an altitude above 960f t at its current position to avoid wake turbulence. The graphical representation displays this difference of 65f t using the existing altitude strip’s scale/location giving the pilot the ability to quickly glance at the altimeter and perceive both values with ease. The pilot of T simply needs to hold the aircraft above the altitude indicated on the wake bar to avoid potential wake turbulence; this holds true for landing, take off, and cruising. We chose to augment the existing altitude strip to show the leading aircraft’s altitude information based on input from several pilots from Ohio University’s Avionics Engineering Center as well as research engineers familiar with rapid prototyping of new cockpit instrumentation. The strip was placed on the right side of the display to mirror common Heads-Up Displays (HUDs), such as those found in Honeywell’s SmartView flight display [24] and Microsoft Flight Simulator 2011. Strictly speaking, this prototype does not augment the pilot’s reality via a HUD; instead, the pilot is required to view a display visualizing the augmented reality. 2.1 Implementation via the STEAMiE Visualization Engine This prototype focuses on the simple case involving only one leading aircraft, H, and one trailing aircraft, T . The more general case involving many aircraft flying independent courses requires a more sophisticated tracking algorithm and is discussed in section 4. This prototype uses the STEAMiE Visualization Engine [25] to create a synthetic world which mirrors the real world in real time; this engine has been previously used to visualize scientific data as well as interactive virtual worlds [26, 27]. In this experiment, STEAMiE consumes data from several data sources simultaneously; these include satellite imagery from Microsoft Bing Maps Platform©[28], elevation data from the U.S. Geological Survey[29], and real-time flight data from on-board altimeters, onboard GPS receivers, and on-board inertial units. STEAMiE fuses these data into a real-time virtual world accurately portraying the local environment and current flight path of both aircraft as shown in Figs.1,2; see Fig.3 for a block diagram showing the inputs and outputs. In essence, this virtual world accurately represents reality within the time and error bounds of the aforementioned input data sources / devices. Using the
222
S. Nykl et al.
Fig. 1. (Left)Leading aircraft H’s flight path is indicated by the red ribbon; trailing aircraft T ’s flight path is indicated by the blue ribbon. The altitude strip shows the current altitude of T , and the semi-transparent red bar overlaying the altitude strip shows the altitude of H’s wake turbulence.(Right)Leading aircraft H’s flight path is indicated by the red ribbon; trailing aircraft T ’s flight path is indicated by the blue ribbon. Ohio University’s UNI Airport is visible in the lower right.
Fig. 2. On the right side is leading aircraft H, its red ribbon visualizes its flight path up to the current time. On the left side is trailing aircraft T , its blue ribbon is just out of view in the forefront, but visible in the background as T performed a 180° maneuver.
data within this virtual world, an augmented reality including the enhanced altimeter is generated and displayed at 60 frames per second to a monitor mounted in T ’s cockpit. Furthermore, a colored ribbon of H and T ’s flight paths can be shown within this world, see the red and blue ribbons, respectively, as shown in Figs.1,2. The red ribbon generated by H is the conservative wake turbulence surface. This is the raw data set consumed by the tracking algorithm described in section 2.2 which populates the wake value displayed in the augmented altimeter. In this prototype, the satellite imagery of the surrounding landscape and elevation data was displayed to the pilot along with H’s altitude information. In some cases, this may provide unnecessary information to the pilot resulting in visual overload or visual “noise”; Honeywell’s system, for example, simply shades the terrain removing
A Flight Tested Wake Turbulence Aware Altimeter
223
Fig. 3. A block diagram showing the inputs and outputs of this prototype. Two STEAMiE instances are created; one in the leading and one in the trailing aircraft. The receiving STEAMiE instance uses information from the transmitting STEAMiE instance in combination with its local information to create an augmented virtual world including the enhanced altitude strip.
the potentially distracting detail of high resolution satellite imagery [24]. In our case, the main focus for this rapid prototype was the wake information and our flight team was satisfied using local satellite imagery. Another design decision was to draw vertical plumb lines from the ribbon down to the ground surface; although the wake turbulence may be avoided by flying far enough below H, it would be impossible for T to maneuver above H’s wake vortices without changing heading or intersecting the vortices; in the case of a final approach, T must approach above H, should T find itself below H on a final, T would be forced to abort the approach, circle around, and attempt final again from a higher altitude. 2.2 Tracking Algorithm for Conservative Wake Turbulence Surfaces This tracking algorithm operates on the position data of H and provides conservative wake turbulence information for the pilot/user. The term conservative in this context comes from the wake turbulence model used in this visualization. For this visualization, it is assumed that a wake turbulence is generated at every sampled position of H and remains stationary for a fixed amount of time before dissipating. Except in cases of severe updrafts, wake turbulence will only descend; lateral movement is possible, but even in these cases, reporting the highest nearby point gives a conservative prediction. This is opposed to more high fidelity models, like those explored in [30, 31].
224
S. Nykl et al.
Fig. 4. Two implemented intersection volumes; a semicylinder on the left and a cone on the right (not to scale.)
Nearby can be abstracted to mean “within some intersection volume, within some recent time interval.” For this prototype, two different intersection volumes have been tested. Visualizations of the two intersection volumes are given in Fig.4. The first intersection volume tested was a semi-cylinder with its lengthwise axis aligned with the gravity direction of the simulation and centered pointing along the heading vector of T . A radius of one nautical mile was chosen as a safe spatial distance for wake turbulence avoidance and three minutes was chosen as a safe timeout interval for the sampled data points of H. The semi-cylindrical intersection volume was sufficient in many of the test cases, but one particular case was found where results were too conservative useful. In the case where H crosses through the intersection volume twice, flying in opposite directions, but at two different altitudes, the higher altitude will always be taken. This scenario occurs frequently when one plane is already on their final approach, while the second plane is flying its base leg (the opposite direction) in preperation of turning onto its final approach. Because of lateral movement considerations, this heightened reading is correct under the given constraints, but it was determined that an alternate conical volume would be superior in cases involving both planes on their final approach, which is the situation where this visualization is most valuable. The conical intersection volume was successful at restricting the search space for wake turbulence in the case outlined above. Because the conical intersection volume is narrower near T and expands at a distance, it does a better job at predicting likely turbulence instead of possible turbulence. Because of its shape, the conical intersection volume has the same 100% detection rate, on a final approach, as the semi-cylindrical volume, but produces fewer cautions in situations of possible, but unlikely, wake turbulence like the one above.
3 Experimental Results The prototype was tested in May 2011 using two aircraft, H and T , based at Ohio University’s Snyder Field (UNI). The configurations of H and T are described in sections
A Flight Tested Wake Turbulence Aware Altimeter
225
Fig. 5. Prototype pilot’s cockpit display visualizing the real time virtual world
Fig. 6. Screen shots showing T converge on the runway while visualizing H’s conservative wake turbulence surface and corresponding wake strip on the altimeter
2 and 2.1 are summarized in Fig.3. The flight configuration involved H approaching the runway from 10N M while T circled at a distance of 5 − 7N M from the runway. When H approached the ~7N M mark, T performed a standard “base leg” maneuver to position itself about 1N M behind H matching H’s flight direction. At the point where T ’s semicircle-based inclusion region began intersecting with H’s conservative wake turbulence surface, T ’s wake strip began displaying the corresponding altitude of H’s wake vortices; see Fig.1 and Sec.2.2. The cockpit mounted display shown in Fig.5 is the prototype’s output and pilot’s input. The specific altimeters used gave a 10f t resolution at 1Hz, H’s GPS update frequency was 1Hz, and H’s 329M hz/1200 baud transmitter sent the pseudo ADS-B message to T at 5Hz; this higher transmit frequency implied that only 1 out of 5 related transmits needed to be received properly by T to update at the maximum frequency of H’s altimeter and GPS. This was useful when the system was operating at the maximum tested range of about ~13N M ; the system did not ever lose communication for any extended period of time. T ’s GPS was integrated with a 100Hz inertial unit and used a Kalman filter to interpolate its GPS position at 100Hz; therefore, T moved with a smooth position and
226
S. Nykl et al.
orientation, including smooth updates to the pitch ladder as shown in Fig.6. The actual virtual world was displayed at 60Hz to the pilot; as a result, the display shown in Figs.5 and 6 was virtually indistinguishable from commercial flight simulation software, with the exception that this represented the current state of the real world. For a brief YouTube video of the actual test flight and one of the synthetic world output, see [32, 33], respectively.
4 Conclusion and Future Work Our flight tested prototype successfully provided the pilot with a real-time instrument to more precisely avoid wake turbulence from a leading aircraft; as was the original intent of this research. The necessary data for this prototype is available via the FAA mandated ADS-B protocol and therefore requires no new communications equipment to be installed[23]. Ohio University’s Avionics Engineering Center’s acting Chief Pilot Jamie Edwards, described his experience with the prototype, “The displayed wake strip information gives a pilot actual guidance to stay above the flight path of the aircraft ahead of him. It can give the pilot of the light aircraft in this situation a peace-of-mind never known before.” He further described the prototype’s usage during flight, “The red wake strip concept is intuitive to fly. It overlays the altitude strip and gave me a direct indication as to my vertical position relative to the aircraft in front of me when he was at my same distance from the airport. It easily allowed me to set a descent rate that kept the red wake strip below my current altitude and effectively clear of his wake turbulence.” Future work will focus on extending the tracking algorithm to multiple aircraft and clearly conveying this information to the pilot in an intuitive manner without causing “information overload.” Furthermore, we would like to enhance the synthetic world to quickly give the pilot wake turbulence information for all local air traffic which will decrease necessary aircraft spacing, increase airport capacity, minimize airport congestion, and relieve wake turbulence related air traffic control responsibilities while improving overall safety. Finally, we intend to incorporate more advanced wake turbulence models, such as those found in [30, 31], and compare trade offs between the additional aircraft density allowed by these models and safety.
References 1. Gerz, T., Holzpfel, F., Darracq, D.: Commercial aircraft wake vortices. Progress in Aerospace Sciences 38, 181–208 (2002) 2. Hay, G.C., Passman, R.H.: Wake Turbulence Training Aid. Federal Aviation Administration, 800 Independence Ave., S.W. Washington, DC 20591 (1995) 3. Boeing: Wake turbulence avoidance: A pilot and air traffic controller briefing (vhs 1/2 inch) (video). VHS 1/2 inch Video (1995) 4. Greene, G.C.: Approximate model of vortex decay in the atmosphere. Journal of Aircraft 23, 566–573 (1986); Cited By (since 1996): 81. 5. Corjon, A.: Vortex model to define safe aircraft separation distances. Journal Of Aircraft 33 (1996)
A Flight Tested Wake Turbulence Aware Altimeter
227
6. Kantha, L.H.: Empirical model of transport and decay of aircraft wake vortices. Journal Of Aircraft 35 (1998) 7. Robins, R., Delisi, D., Aeronautics, U.S.N., Administration, S., Center, L.R., Northwest Research Associates, I.: Wake vortex algorithm scoring results. Citeseer (2002) 8. Robins, R., Delisi, D.: Further development of a wake vortex predictor algorithm and comparisons to data. AIAA Paper, 99–0757 (1999) 9. Soudakov, G.: Engineering model of the wake behind an aircraft. Trudy TsAGI 2641 (1999) 10. Sarpkaya, T.: New model for vortex decay in the atmosphere. Journal of Aircraft 37, 53–61 (2000); Cited By (since 1996): 42 11. Zheng, Z.C., Lim, S.H.: Validation and operation of a wake vortex/shear interaction model. Journal of Aircraft 37, 1073–1078 (2000); Cited By (since 1996): 8. 12. Moet, H., Darracq, D., Corjon, A.: Development of a decay model for vortices interacting with turbulence. In: AIAA, Aerospace Sciences Meeting and Exhibit, 39 th, Reno, NV (2001) 13. Mokry, M.: Numerical simulation of aircraft trailing vortices interacting with ambient shear or ground. Journal of Aircraft 38, 636–643 (2001); Cited By (since 1996): 14 14. Centre, T.C.T.D., Jackson, W., Yaras, M., Harvey, J., Winckelmans, G., Fournier, G., Belotserkovsky, A.: Wake vortex prediction: An overview. Wayne Jackson (2001) 15. Holzapfel, F.: Probabilistic two-phase wake vortex decay and transport model. Journal of Aircraft 40, 323–331 (2003) 16. Gurke, T., Lafferton, H.: The development of the wake vortices warning system for Frankfurt airport: Theory and implementation. Air Traffic Control Quarterly 5, 3–29 (1997) 17. Frech, M., Holzpfel, F., Gerz, T., Konopka, J.: Short-term prediction of the horizontal wind vector within a wake vortex warning system. Meteorological Applications 9, 9–20 (2002); Cited By (since 1996): 8 18. Le Roux, C., Corjon, A.: Wake vortex advisory system implementation at Orly airport for departing aircraft. Air Traffic Control Quarterly 5, 31–48 (1997) 19. Hinton, D., Charnock, J., Bagwell, D., Grigsby, D.: NASA aircraft vortex spacing system development status. In: 37 th Aerospace Sciences Meeting & Exhibit, Reno, NV, AIAA, pp. 99–753 (1999) 20. Hinton, D., Charnock, J., Bagwell, D.: Design of an aircraft vortex spacing system for airport capacity improvement. AIAA 622, 1–18 (2000) 21. Frech, M.: VORTEX-TDM, a parameterized wake vortex transport and decay model and its meteorological input data base. Deutsche Flugsicherung, DFS, Langen (2001) 22. Greene, G., Rudis, R., Burnham, D.: Wake Turbulence Monitoring at San Francisco. In: 5th Wake Net Workshop, DFS Acadamy, Langen, vol. 2 (2001) 23. Barhydt, R., Warren, A.W.: Development of Intent Information Changes to Revised Minimum Aviation System Performance Standards for Automatic Dependent Surveillance Broadcast (RTCA/DO-242A). National Aeronautics and Space Administration, NASA Langley Research Center Hampton, VA 23681-2199 (2002) 24. Aerospace, H.: Smartview synthetic vision system (2011) 25. Nykl, S., Mourning, C., Leitch, M., Chelberg, D., Franklin, T., Liu, C.: An overview of the STEAMiE educational game engine. In: IEEE 38th Annual Conference on Frontiers in Education, FIE 2008, pp. F3B–21 (2008) 26. Mourning, C., Nykl, S., Xu, H., Chelberg, D., Liu, J.: GPU acceleration of robust point matching. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Chung, R., Hammound, R., Hussain, M., Kar-Han, T., Crawfis, R., Thalmann, D., Kao, D., Avila, L. (eds.) ISVC 2010. LNCS, vol. 6455, pp. 417–426. Springer, Heidelberg (2010) 27. Mourning, C., Nykl, S., Chelberg, D., Franklin, T., Liu, C.: An overview of first generation steamie learning objects. In: Siemens, G., Fulford, C. (eds.) Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2009, Honolulu, HI, USA, AACE, pp. 3748–3757 (2009)
228
S. Nykl et al.
28. Microsoft: Microsoft bing maps platform (2011) 29. USGS: United states geological survey (2011) 30. Holforty, W., Powell, J.: Flight deck display of airborne traffic wake vortices. In: IEEE The 20th Conference on Digital Avionics Systems, DASC 2001, vol. 1, pp. 2A3–1(2001) 31. Holforty, W.: Flight-deck display of neighboring aircraft wake vortices (2003) 32. Nykl, S.: Steamie engine wake turbulence test flight - ohio university avionics engineering center (2011), http://www.youtube.com/watch?v=jZdQGOwTe2k 33. Nykl, S.: Steamie engine wake turbulence aware altimeter - ohio university avionics engineering center (2011), http://www.youtube.com/watch?v=IBf5xqzB5m0
A Virtual Excavation: Combining 3D Immersive Virtual Reality and Geophysical Surveying Albert Yu-Min Lin1 , Alexandre Novo2, Philip P. Weber1 , Gianfranco Morelli2 , Dean Goodman3 , and J¨ urgen P. Schulze1 1
California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, CA, 92093, USA 2 Geostudi Astier, via A. Nicolodi, 48 Livorno 57121, Italy 3 Geophysical Archaeometry Laboratory, 20014 Gypsy Ln, Woodland Hills, CA 91364, USA
Abstract. The projection of multi-layered remote sensing and geophysical survey data into a 3D immersive virtual reality environment for noninvasive archaeological exploration is described. Topography, ultra-high resolution satellite imagery, magnetic, electromagnetic, and ground penetrating radar surveys of an archaeological site are visualized as a single data set within the six-sided (including floor) virtual reality (VR) room known as the StarCAVE. These independent data sets are combined in 3D space through their geospatial orientation to facilitate the detection of physical anomalies from signatures observed across various forms of surface and subsurface surveys. The data types are highly variant in nature and scale, ranging from 2D imagery to massive scale point clouds. As a reference base-layer a site elevation map was produced and used as to normalize and correlate the various forms of collected data within a single volume. Projecting this volume within the StarCAVE facilitates immersive and collaborative exploration of the virtual site at actual scale of the physical site.
1
Introduction
Non-invasive investigations of subsurface anomalies through geophysical surveys can provide archaeologists with valuable information prior to, or in-place of, the non-reversible processes of excavation. This can be extremely useful, especially in cases where excavation is not an option or restricted. Furthermore, these tools can be used to monitor the state of preservation of sites or monuments through nondestructive analysis [1]. Geophysical methods, such as magnetic [2, 3], electromagnetic (EM) [4–6], and ground penetrating radar (GPR) [7, 8], detect features by observing variations of physical properties of materials within a matrix. Each of these methods exploits different physical properties to generate maps of the variations. Magnetic survey is a passive detection of contrasts in the magnetic properties of differing materials, whereas EM surveys measure the conductivity and magnetic susceptibility of soil by inducing eddy currents through a generated electromagnetic G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 229–238, 2011. c Springer-Verlag Berlin Heidelberg 2011
230
A.Y.-M. Lin et al.
field. GPR transmits an electromagnetic pulse and measures a reflected signal that is dependent upon the dielectric properties of subsurface material [9]. With GPR, it is possible to reconstruct high-resolution 3D data visualizations of the composition of the subsurface [10–12]. While there have been many impressive advances in data processing techniques to enable this, less focus has been applied to the potential of non-standard visualization environments to further the ability to generate virtual representation of the subsurface. For example, the “StarCAVE” is a virtual reality (VR) environment operating at a combined resolution of 68 million pixels, 34 million pixels per eye, distributed over 15 rear-projected wall screens and 2 downprojected floor screens [13]. The goal of this paper is to explore the use of the StarCAVE to enable non-invasive “virtual excavation” through the 3D VR reconstruction of geophysical survey data of an archaeological site that was investigated in July, 2010 as a component of the Valley of the Khans Project, a non-invasive remote sensing survey for burial sites in Northern Mongolia. Due to local customs that prohibit the destruction of any burial grounds, this case study serves as an example where geophysics and virtual reality representations of archaeological sites provide an alternative to destructive surveys. The rest of the paper is organized as follows. First we look at related work. In Section 3, we describe the data collection, processing and visualization methods used in this study. In Section 5, we discuss our results and observations of 3D virtual reality visualization of the data. Finally, Section 6 summarizes our main conclusions.
2
Related Work
There are many fully featured software tools for the visualization of ground penetrating radar data sets. Among others are Mala GeoScience [14], AEGIS Easy 3D [15] and Halliburton GeoProbe [16]. None of them, however, supports immersive 3D environments, and can thus not take advantage of the high resolution and to scale capabilities of CAVE-like systems. Some prior work uses direct volume rendering for the display of the data [17– 21], which would require resampling our data since the GPR data needs to be displayed in a way that follows the shape of the terrain. Billen et al. [22] created an immersive application for CAVE environments, but it does not allow visualizing the data as points, which permit very precise analysis of the data on the basis of individual data values and allow for the data to follow the terrain it is under.
3
Data Collection Methods
A 85 x 80 meter archaeological site was identified for survey by observing surface artifacts in and around the roots of fallen trees. A site grid comprised of 5 x 5 meter cells oriented along the geographical north was marked in the field in order to acquire data in a regular pattern as neither GPS nor Total station was used.
A Virtual Excavation
231
Fig. 1. GPR survey of this study’s field site in Northern Mongolia with the IDS dual frequency antenna detector
Each grid was positioned based on local coordinates and data were collected following parallel lines spaced by 25 cm. Sub-meter resolution GPS was used to record the UTM coordinates of the corners of the grid. An Overhouser gradiometer was used in this study. During the survey, the distance between sensors was set at 1.5m and the distance between the lower of the two sensors and the ground was maintained at 0.2m. Data was collected in “fast walking” mode at 0.5 seconds cycling rate following parallel NorthSouth transects approximately 1m apart. The internal sub-meter GPS of the gradiometer was employed for data positioning. The EM-38 electromagnetometer creates measurements of ground conductivity (quad-phase) in milliSiemens per meter (mS/m) and magnetic susceptibility (in-phase) in parts per million. The maximum effective depth range (1.5 m) was achieved by collecting data in the vertical dipole mode. Data collection was performed in walking mode at a cycling rate of 2 readings per second following parallel transects approximately 1m apart. An internal sub-meter GPS recorded geospatial positions of scans and an external data logger allowed the operator to view position and raw data in real time. This study used a IDS GPR system with a dual frequency antenna at 250 MHz and 700 MHz for simultaneous investigation of deep and shallow targets, respectively, as seen in Figure 3. Parallel profiles 0.25 meters apart were followed using string as a guideline, in order to assist the operator in pushing the GPR antenna across a generated surface grid [7]. This method, along with 3D visualization techniques, have been widely applied in GPR surveys for archaeology [23, 24]. Time slices represent maps of the amplitudes of the recorded reflections across the survey area at a specified time. The processed reflection traces in all reflection profiles were then used to generate three-dimensional horizontal slices by spatially averaging the squared wave amplitudes of the recorded radar reflections over the time window. The interpolation process creates interpolated time-slices,
232
A.Y.-M. Lin et al.
which are normalized to 8 bit following the color changes between different levels and not actual reflection values. The number of slices depends on the length of the time window selected, the slice thickness and the overlay between slices. Thickness of horizontal slices is often set to at least one or two wavelengths of the radar pulse that is sent into the ground. The raw data set size is 153MB. The resolution of the recorded data was preserved in the visualization. The radargrams were resampled to a constant number of scans per marker. We signed a marker about every meter, 32 new scans were made between meter markers. This step creates an equidistant scan distribution along the radargrams. The uneven terrain where data were collected produces slips of the survey wheel which affects constant scan distribution along the profiles. 3.1
Data Preprocessing
A digital model terrain (DMT) map of the grid was generated from measurements made at each cell corner to reference small changes in the topography of the site using the software Google SketchUp. Two-dimensional EM and MAG images were warped onto the surface topography map within Google SketchUp by matching georeferenced 2D geophysical data with the UTM coordinates of each grid corner. Three-dimensional GPR sub-cubes of each sampling area were generated from processed radargrams. A complete 3D cube of the entire site was generated by merging the point clouds of each sub-cube. Finally, this cube is corrected for topography from the overlapping DMT model of the entire site. A diagram of the various data layers (with a single depth-slice representation of GPR data is seen in Figure 2. To model the site and data correctly in a virtual environment, the terrain was first constructed and then the subsurface radar data was mapped to the terrain model. The terrain was created with the localized height data as a height field. The data consisted of a grid of 5×5 meter squares where at each corner a vertical difference in meters was collected relative to a local origin. The subsurface radar data that was collected consisted of a local position relative to the local origin, a depth and intensity value. The depth value that was collected was relative to the surface, therefore the data depth was preprocessed by bi-linear interpolating the values from the height field. This resulted in the visualized subsurface data correctly following the contours of the generated terrain model.
4
Software Implementation
The software application was written as a C++ plug-in for CalVR, see Figure 3. CalVR is a virtual reality middleware framework developed at UCSD for the purpose of providing a modern VR software environment which can drive any kind of VR display system, and use all typical tracking systems. CalVR is free and open source and was developed for the VR community as a research platform. Plug-ins are separately compiled dynamic libraries which integrate into
A Virtual Excavation
233
Fig. 2. Three layers of geophysical data warped over a topographical site map generated in Google SketchUP
CalVR in a modular fashion, i.e., they can be active or deactive in any given VR session. CalVR internally uses the OpenSceneGraph API [25] for its graphical objects. Most plug-ins create their visual elements in OpenSceneGraph (OSG) as well, but it is possible to use OpenGL directly, encapsulated in a simple OSG node so that OpenGL based graphics can co-exist with OSG-based graphics. The application at hand uses a combination of OpenGL and OSG-based graphical elements. 4.1
Surface Textures
For spatial context, the user can select one of three different surface textures: it can either just be the 5 × 5 meter grid, which also contains some textual information and landmarks, or it can be the grid along with magnetic surface information, or it can be the grid with the electro-magnetic data set superimposed on it. Figure 4 illustrates these options. The user can switch through these
234
A.Y.-M. Lin et al.
Fig. 3. Site topography map and GPR data displayed as a stereo projection within the StarCAVE virtual reality environment (images in paper show only one stereo channel)
Fig. 4. Three map modes: just topology, magnetic, electro-magnetic
three options but moving a little joystick on the VR wand up or down. OSG’s Terrain class manages surface structure and textures. The coordinate system is such that +x is east, +y is north, and +z is depth from the surface. We render the surface texture translucent at an alpha level of 50% so that it cannot occlude the subsurface radar data. The surface textures are user configurable: additional textures can easily be added, and they can be in any image format OSG supports; we currently use PNG. 4.2
Subsurface Radar Data
We display the subsurface radar data as a collection of points, see Figure 5. Each point represents a sample of the radar, which represents a volume of about one cubic centimeter. GPR values are available up to a depth of about 2-3 meters. The subsurface radar data set lists the points with x/y coordinates on the surface, but the z value is defined relative to the surface. Hence, in order to display the
A Virtual Excavation
235
Fig. 5. Moving through GPR point cloud data in 3D virtual reality
data in their correct positions in 3D, we calculate the height of the terrain at the respective x/y position and then offset the point by that amount. The x/y coordinates are on a regular grid, but not all grid cells actually contain data. This is why we store x/y coordinates with every point, rather than storing a list of heights with implicit x/y coordinates as an array. The points are spaced about a diameter apart, so that they create a continuous layer. We color code the points based on density and use a color gradient from blue through green and yellow to red, to indicate different levels of density. The entire area of interest contains more than 13 million sample points. However, our rendering system is not capable of rendering this many points at once. Therefore, we only always render about one million points at once, in order to achieve an interactive rendering frame rate of about 30 frames per second in the StarCAVE. The samples are sorted by height, so that by rendering sets of one million points we display points of one or a maximum of two layers at a time. The user can switch between the point subsets by moving the little joystick on the wand left or right. Three different settings for height are shown in Figure 6. There is a short delay of less than a second whenever this switch happens, caused by the new set of points having to be loaded. Rendering one million points at interactive frame rates is not trivial. Plain OpenGL points always project to the same number of pixels on the screen, as opposed to an area which depends on how close the point is to the viewer. Therefore, we decided to use GLSL point sprites instead, which require the implementation of vertex and fragment shaders. We use the shaders from OSG’s Compute example for CUDA programming [26]. This shader uses the OpenGL lighting parameters to achieve shading effects matching the shading of the rest of the scene. The transfer of the point data to the graphics card happens through
236
A.Y.-M. Lin et al.
Fig. 6. Three different layers of GPR data
a vertex buffer object (VBO). Whenever the user switches to another subset of points, this VBO gets filled with the new points. The colors for the points are determined by a pre-defined one-dimenstional, 256 element look-up table, which is pre-set with the aforementioned color gradient. 4.3
Usage Information
A complete data set for this application consists of the following files: a configuration file, one or more surface texture files, and one GPR sample data file. The configuration file contains information about the three texture files providing the ground textures, the grid size the textures are on (using 5x4 meter squares), the number of binary files referenced for the point data, and the names of those binary files. Each point in the binary file consists of 4 floats (x,y,z position and an intensity value). The points are sorted in x, y and z (height). At the end of the configuration file is a list of number triples for the height for select grid points, given as x/y and position within the 5x5 meter grid system. The OSGTerrain library will interpolate missing height data, so it is not critical that this list strictly follow the data grid. Once the GPR plugin has been enabled in CalVR’s configuration file, it can be run by passing the name of the configuration file to the executable: CalVR
.
5
Discussion
Representing the data at its original scale is one of the most important benefits of the visualization in virtual reality. Other benefits are that more data is visible at a time thanks to the high pixel count in the StarCAVE. Another benefit of the application is that the switching through the various layers of GPR data happens almost immediately. This takes significantly longer with the desktop-based software the researchers use, presumably because the virtual reality application was specifically designed for GPR data display. The choice of displaying the data as points showed to be good because this makes it easy to render the data just below the terrain, following the terrain surface. Since each point has its own position, it is easy to modify this position to always be a certain amount below the surface.
A Virtual Excavation
6
237
Conclusions
The representation of data in virtual reality space allows an immersive projection of data at its original scale. The reconstruction of geophysical data in virtual space is an especially relevant application of 3D visualization, where physical exploration is not possible and virtual exploration is limited by methods of collection and visualization. The presented software application for the StarCAVE allows quicker insight into the data than desktop based methods can, and it can show more data at a time.
References 1. Watters, M.S.: Gpr: a tool for archaeological management. In: Proceedings of the Tenth International Conference on Ground Penetrating Radar, GPR 2004, pp. 811–815 (2004) 2. Becker, H.: From nanotesla to picotesla-a new window for magnetic prospecting in archaeology. Archaeological Prospection 2, 217–228 (1995) 3. Aitken, M.J.: Magnetic prospecting. i. the water newton survey. Archaeometry 1, 24–26 (1958) 4. Frohlich, B., Lancaster, W.: Electromagnetic surveying in current middle eastern archaeology: Application and evaluation. Geophysics 51, 1414–1425 (1986) 5. Tabbagh, A.: Applications and advantages of the slingram electromagnetic method for archaeological prospecting. Geophysics 51, 576–584 (1986) 6. Abu Zeid, N., Balkov, E., Chemyakina, M., Manstein, A., Manstein, Y., Morelli, G., Santarato, G.: Multi-frequency electromagnetic sounding tool EMS. Archaeological discoveries. Case stories. In: EGS - AGU - EUG Joint Assembly, Nice, France, vol. 5 (2003) 7. Novo, A., Grasmueck, M., Viggiano, D., Lorenzo, H.: 3D GPR in archaeology: What can be gained from dense data acquisition and processing. In: Twelfth International Conference on Ground Penetrating Radar (2008) 8. Goodman, D., Nishimura, Y., Rogers, J.: GPR time slices in archaeological prospection. Archaeological prospection 2, 85–89 (1995) 9. Davis, J., Annan, A.: Ground penetrating radar for high-resolution mapping of soil and rock stratigraphy. Geophysical Prospecting 37, 531–551 (1989) 10. Watters, M.S.: Geovisualization: an example from the catholme ceremonial complex. Archaeological Prospection 13, 282–290 (2006) 11. Nuzzo, L., Leucci, G., Negri, S., Carrozzo, M., Quarta, T.: Application of 3d visualization techniques in the analysis of gpr data for archaeology. Annals Of Geophysics 45, 321–337 (2009) 12. Grasmueck, M., Weger, R., Horstmeyer, H.: Full-resolution 3d gpr imaging. Geophysics 70, K12–K19 (2005) 13. DeFanti, T.A., Dawe, G., Sandin, D.J., Schulze, J.P., Otto, P., Girado, J., Kuester, F., Smarr, L., Rao, R.: The starcave, a third-generation cave and virtual reality optiportal. Future Generation Computer Systems 25, 169–178 (2009) 14. GeoScience, M.: Windows based acquisition and visualization software (2010), http://www.idswater.com/water/us/mala_geoscience/data_acquisition_ software/85_0/g_supplier_5.html 15. Instruments, A.: Easy 3D - GPR Visualization Software (2010), http://www. aegis-instruments.com/products/brochures/easy-3d-gpr.html
238
A.Y.-M. Lin et al.
16. Halliburton: GeoProbe Volume Interpretation Software 17. Ropinski, T., Steinicke, F., Hinrichs, K.: Visual exploration of seismic volume datasets. In: Journal Proceedings of the 14th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG 2006), vol. 14 (2006) 18. Chopra, P., Meyer, J., Fernandez, A.: Immersive volume visualization of seismic simulations: A case study of techniques invented and lessons learned. IEEE Visualization (2002) 19. Winkler, C., Bosquet, F., Cavin, X., Paul, J.: Design and implementation of an immersive geoscience toolkit. IEEE Visualization (1999) 20. Froehlich, B., Barrass, S., Zehner, B., Plate, J., Goebel, M.: Exploring geo-scientific data in virtual environments. In: Proceedings of the Conference on Visualization 1999. IEEE Computer Society Press, Los Alamitos (1999) 21. LaFayette, C., Parke, F., Pierce, C., Nakamura, T., Simpson, L.: Atta texana leafcutting ant colony: a view underground. In: ACM SIGGRAPH 2008 talks. ACM, New York (2008) 22. Billen, M., Kreylos, O., Hamann, B., Jadamec, M., Kellogg, L., Staadt, O.D.: A geoscience perspective on immersive 3d gridded data visualization, vol. 34. Computers & Geosciences 23. Leckebusch, J.: Ground-penetrating radar: a modern three-dimensional prospection method. Archaeological Prospection 10, 213–240 (2003) 24. Linford, N.: From hypocaust to hyperbola: ground-penetrating radar surveys over mainly Roman remains in the UK. Archaeological Prospection 11, 237–246 (2004) 25. OpenSceneGraph: Scenegraph based graphics library (2004), http://www. openscenegraph.org 26. Orthmann, J., Keller, M., Kolb, A.: Integrating GPGPU Functionality into Scene Graphs. Vision Modeling Visualization (2009)
Experiences in Disseminating Educational Visualizations Nathan Andrysco1,2 , Paul Rosen3 , Voicu Popescu2 , Bedˇrich Beneˇs2 , and Kevin Robert Gurney4 1
Purdue University Intel Corporation 3 University of Utah Arizona State University 2
4
Abstract. Most visualizations produced in academia or industry have a specific niche audience that is well versed in either the often complicated visualization methods or the scientific domain of the data. Sometimes it is useful to produce visualizations that can communicate results to a broad audience that will not have the domain specific knowledge often needed to understand the results. In this work, we present our experiences in disseminating the results of two studies to national audience. The resulting visualizations and press releases allowed the studies’ researchers to educate a national, if not global, audience.
1
Introduction
For centuries, scientists have been formulating experiments, recording data, and sharing results with others all in the hope of advancing human understanding of the physical world. For much of that time, the sharing of data and results from the experiments consisted of producing equations and sets of charts, tables, and graphs. These methods are typically geared toward experts in a particular scientific field, which makes it very difficult for non-expert individuals to understand the concepts presented and results achieved. This limited ability to communicate important results with a broader community can lead to slowed social progress and misconception of scientific fact. Visualizations used for public consumption have some extra challenges compared to those visualizations meant for experts with years of training in a specific domain. Scientific experts will often work with those creating the visualizations, which means these domain specialists will have some insight into the resulting images and trust that the results are faithful to the underlying data. Conversely, those in a broad audience might be skeptical of both the scientific computations and the visualization method used to produce images. The general public also requires intuitive visualization methods placed into a self-explanatory context, both which might not be necessary if only communicating the data to experts. Details that might be of great value to domain experts may only serve to confuse those without the underlying scientific knowledge. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 239–248, 2011. c Springer-Verlag Berlin Heidelberg 2011
240
N. Andrysco et al.
In this article we will describe two visualizations that have been released publicly with the hope of educating a broad class of people. The first is a study of the atmospheric concentration of fossil fuel CO2 emissions across the continental United States [1,2]. The second displays results of a study of the damage done by the aircraft during the September 11 Attack on the World Trade Center North Tower (WTC-1) [3,4]. We will discuss the public’s response to the results, as well as what went right and wrong during the press releases.
2 2.1
Studies CO2 Concentrations over the United States
Global warming and its causes have become a very popular topic in recent years. Over the past 20 years it has been confirmed that rising greenhouse gas levels, particularly carbon dioxide (CO2 ), have a significant contribution to the climate change problem. Without proper estimates of CO2 emissions at fine enough scales, atmospheric experts are unable to make meaningful progress on better understanding carbon cycling through the land, ocean, and atmosphere. High resolution fossil fuel CO2 emissions estimation also contributes to better decision making on emissions mitigation and policy projections. The lack of high resolution fossil fuel CO2 emissions data led Purdue University researchers to create the Vulcan Project [1]. The Vulcan Project is a National Aeronautics and Space Administration and U.S. Department of Energy funded project with the purpose of quantifying fossil-fuel CO2 emissions down to levels as detailed as neighborhoods and roadways. Emissions data was estimated based on a large variety of datasets such as air-quality reporting, energy / fuel statistics, traffic and census data. The data is then combined and sampled to a grid with a resolution of 10 km2 /hr, in all totalling 13 GB. The native data before regularized gridding is even more extensive. Emissions data gives a good understanding off where fossil-fuel CO2 emissions originates at the surfaces. But that, of course, is not the entire picture. It is very important to understand how CO2 is propagated through the atmosphere due to mixing and atmospheric transport. The atmospheric CO2 concentrations were simulated by inputting the emissions data into the Regional Atmospheric Modeling System (RAMS) [5]. To simulate four contiguous months, the Vulcan-RAMS analysis required about a week of computation on a 50-node Linux cluster. 2.2
World Trade Center North Tower
The attacks on September 11, 2001 began a broad debate about who was responsible, and revealed a broad and deep seeded mistrust of the government. The simulation of the attacked on the North Tower of the World Trade Center (WTC-1) began as a larger scale follow-up to the previously released study on the attack on the Pentagon [6]. The goal of both was to help explain the underlying physics which occurred in both attacks.
Experiences in Disseminating Educational Visualizations
241
The simulation of the attack on WTC-1 first required modeling the aircraft, a 767-200ER, and the structure of the WTC-1 tower. The aircraft was modeled using a cross sectional drawing of the aircraft design and images of components, such as the engines and landing gear. The structure of the WTC-1 tower was modeled from top to bottom, using architectural drawings and first hand expert knowledge. Irfanoglu and Hoffmann [3] further detail the modeling and verification procedures that were used. The work was only focused on the damage done by the plane colliding with the building. Therefore, in the end, the authors limited the simulation to the top 30% of the building, the region most directly affected by the initial attack. The impact simulations were then run using the nonlinear finite-element analysis software, LS-DYNA, on the IBM 16 processor nano-regatta computer system at Purdue University. The researchers typically simulated the first 0.5 second of the time after impact, which required approximately 50 hours of computation.
3 3.1
Visualizations CO2 Concentrations over the United States
Among one of the many goals of the Vulcan Project was to effectively communicate the results of the data to a broad audience. Not only is it important for atmospheric and environmental scientists to understand the data, but it is also important for policy makers and the general public. The visualization researchers worked with domain experts to create a custom program which handles both volumetric and Geographical Information Systems (GIS) data. The use of spatial landmarks was important in the study. It helps convey the information to the viewing public by providing a geographical context. Like those in other fields, atmospheric scientists were most comfortable with visualizations that were relatable to their own studies. In this case, 2D slices
Fig. 1. Some of the visualizations used in the CO2 press release [2]. The general public was able to best relate with the simpler visualizations (top row and right), but CO2 experts felt the more complicated volume rendering (bottom) was the most useful.
242
N. Andrysco et al.
of data corresponding to various atmospheric heights was a familiar method. The first 2D visualization technique is done by using a color map and blending / shading capabilities to create a composed image of CO2 concentration values (Figure 1 top). The second technique is Marching Squares and allows for multiple iso-contour values. The later is important for showing areas with higher than critical values of CO2 concentrations and its evolution up into the atmosphere. 3D visualization was performed using isosurfaces generated from marching cubes (Figure 1 bottom). The use of marching cubes allows CO2 researchers to easily see phenomena such as CO2 transport and weather fronts, which had previously been difficult to extract using their prior visualization methods. Visualizing CO2 concentrations purely at different atmospheric layers, without any regard to latitude/longitude position, brings important insight about the CO2 transport. To eliminate latitude/longitude, each geographic point is projected to a single line using its CO2 concentration. Vertical CO2 columns are connected together with a colored line, with the color indicating the density of points with similar CO2 concentrations. The drawing of all of these lines together results in a graph that looks similar to a histogram (Figure 1 right). The visualizations revealed a number of features. 2D visualizations are able to easily show the daily reoccurring atmospheric processes and energy consumption patterns of the United States (i.e. rush hour). As a result, the images display greater concentrations during the day and smaller concentrations at night. These images also reveal population centers and locations with heavy industry. 3D visualizations allow the user to easily see the transport of CO2 concentrations. In the video, the user can see CO2 concentrations moving from California and into Mexico and from the Eastern seaboard and out across the Atlantic Ocean. One of the more interesting features revealed in the work is the frontal systems in the northern portion of the country. The histogram visualizations also reveal the day-night cycle and properties of the various atmospheric layers. Further analysis of all the visualizations was done by Andrysco et al. [2]. 3.2
World Trade Center North Tower
In order to create a science driven animation, the researchers filtered and converted the complicated simulation data into something more salient. Domain experts would typically use existing post-processors to read and visualize the output from the simulation. These tools allow for the calculation and visualization of many parameters, such as stress, strain, and pressure. However, these parameters are of limited use when presenting the findings to the general public. For all of the powerful features, which are designed for expert users, the post-processors are completely lacking in the ability to produce a high-quality render for the simulation. After all, this is not the focus of these software packages. The goal of the project was instead to transform the FEA database into realistic looking geometry and place that geometry into the context of the real world. For high-quality rendering of the scene data, Autodesk 3D Studio Max was leveraged. In order to import the data into 3D Studio Max, a custom plug-in was developed which took as input an FEA database and output geometry. This
Experiences in Disseminating Educational Visualizations
243
Fig. 2. Visual results of the WTC simulation [4]. Images show outside impact (top), effect of plane on internal structures (middle), and one step of post-processing going from simulation output to creating renderable surfaces (bottom).
geometry could then be rendered with complex materials, lighting, and effects. The 3D Studio Max plug-in generated three distinct types of geometry, shells and beams, fluid, and erosion. Shells are imported directly as a triangle mesh. Beams, which are stored as 3 nodes elements (two nodes represent the line segment end points of the beam, one the beam orientation) are converted from their three node representation into real geometry which matches the shape of the beam, such as I-beams, square, T-beams, or L-beams. The next type of geometry imported is the fluid, jet fuel. The simulation used smooth particle hydrodynamics (SPH) for fluid calculation. In SPH calculations, the fluid is discretized to a set of points with parameters attached, such as volume, velocity, and acceleration. The fluid was imported as a set of nodes and a BlobMesh modifier (a built in 3D Studio Max tool) was then applied to the node set in order to generate the fluid mesh. The BlobMesh modifier uses implicit surfaces to combine nearby nodes to generate objects resembling fluid (Figure 2 bottom). Although the effects of fire were not considered throughout the simulation, fire visualization was added to improve visual quality. The SPH fluid was used to seed the fire effects (Figure 2 middle).
244
N. Andrysco et al.
During simulation, when objects undergo certain amounts of stress they are considered eroded and excluded from future calculation by the simulation software. A special proxy mesh was imported using erosion data which is used to seed special effects such as dust (for eroded concrete) or broken glass shards (Figure 2 middle). To place the simulation into context, the visualization was placed into Google Earth. The impact on the outside of the structure was also used to provide the viewer with a greater sense of context (Figure 2 top). A more detailed analysis of the visualizations was done by Rosen et al. [4].
4 4.1
Response Traditional Media
The video ”’Revolutionary’ CO2 maps zoom in on greenhouse gas sources,” was released on YouTube March 26, 2008 in anticipation of the official University press release on April 7, 2008. News agencies were alerted of the release of the data and visualizations and many local papers picked up the story along with some major news publications, most notably New York Times, Scientific American, and Wired. The video ”Scientists simulate jet colliding with World Trade Center” was released on YouTube June 1, 2007, 12 days before the official University press release on June 12, 2007. Similar to the CO2 press release, the WTC story was picked up by many local papers and written about across the internet. Its greatest success was being shown on a national news program. 4.2
YouTube
For both studies, the main distribution method for the visualizations was YouTube. YouTube provides invaluable statistical tracking features via its Insights tool, which we used to view the number of hits per date and what parts of the video were considered interesting or boring. Viewership Graphs. Figure 3 (top) shows a graph of the daily views of the CO2 visualization. On April 6, 2008, the initial news reports, which generally included a link to the YouTube video, generated about 120,000 views in a single day. The view count received a small spike due to another unrelated video posted under the same YouTube account on July 1, 2008. With the release of Vulcan 1.1 (data mapped into Google Earth), on February 19, 2009, the project again made it into the news. This helped generate an additional 9,000 hits for the video. In total, the video has had over 260,000 views by late May 2011. The WTC daily views are shown in Figure 3 (middle). The first day of the initial press release, June 13, 2008, the visualization received approximately 13,000 views, a number which gradually fell off the following days. That is until June 21, 2007, when the Associated Press picked up the story and the video received over 550,000 views within a single day. Since that time, the video has continued to receive 4,000-6,000 views per day, with the exception of September 11, 2007
Experiences in Disseminating Educational Visualizations
245
Fig. 3. Top: Daily views (left and middle) and popularity relative to all other YouTube videos (right) for the CO2 video. (A) Initial press release. (B) Spike caused by release of an unrelated video. (C) Vulcan 1.1 released. Middle: YouTube graphs for WTC showing daily views (left and middle) and popularity relative to all other YouTube videos (right). (A) Initial press releases. (B) Associated Press picks up press release. (C) Anniversary of the attacked. Bottom: Hot / cold viewing map for the CO2 (left) and WTC (right) videos.
and September 11, 2008. On the anniversary of the initial attacks, the video received over 75,000 views each year. The video has received over 10 million total views by late May 2011. Hot Spot Graph. The YouTube ”Hot Spot” graph shows the viewers interest over the duration of the video. This information is particularly useful for this paper’s context as it provides concrete data of what parts kept the viewers interest and which parts they ignored. For the CO2 video (Figure 3 bottom left), the most interesting part was the 2D surface slice animation over a two month period. It seems that most people fast forwarded straight through the introduction and the static images at the start. Though this part of the video had very informative audio, it was too long and not visually appealing enough. The other visualizations shown were in the ”cold” zone, most likely because they were too long and either not visually appealing enough or the viewer got what they needed from the 2D surface slice animations. The video was probably too long as it was originally meant to be 2 minutes but ballooned to nearly 5 minutes to include the numerous visualizations and the educational audio that the atmospheric scientists wanted.
246
N. Andrysco et al.
The ”Hot Spot” graph for the WTC (Figure 3 bottom right) shows the viewers interest experienced a gradual drop over the course of the video. The dips in the various parts of the graph correlate with the transition to different sections in the video. These natural break points present good opportunity for people to move on to other videos. Viewers most likely skimmed through to the interesting visual effects of these sections. There was a small spike of interest toward the end for viewers to see the end results of the study. From these YouTube statistics, we believe that the visualizations need to be self explanatory when dealing with the general public. The audio, no matter how informative, does not seem to hold their interest. The CO2 video showed the most interest when using 2D visualization techniques, which is understandable since the general public does not have the knowledge or insight to make sense of the volume rendering. This is inline with the articles written about the press release, where they highlighted the 2D portions and neglected the more visually complicated 3D. It should be noted that the CO2 experts found the 3D portion to be the most useful for discovering new patterns in the data, whereas the 2D only validated their models. For the WTC video, the viewers seemed to want to just browse the videos and were mostly interested in being told the result, instead of watching the full video and coming to the conclusion themselves. 4.3
Individual’s Comments
Due to the sensitive and somewhat controversial nature of the studies, many people felt the need to express their thoughts and feelings. The viewer feedback came in two varieties, e-mails and comments left on websites. The majority of e-mails received for the CO2 project were positive. Many of them were from people who wanted to thank the researchers for doing a work that they thought was very important for the environment. Others wanted to inform the researchers that they were going to use the work to help teach their classes. Another common e-mail was a request from people who wanted to learn how to limit their CO2 contribution. News agencies and researchers wanted more detailed data of the U.S. and images for the rest of the world, which is the goal of the next phase of the project. A request by one news agency to see the CO2 mapped to population led to new images being generated and another series of articles on the net. A few researchers and businesses wanted to use the data for their own purposes. The comments left on websites were not nearly as positive. Though global warming was never mentioned in the press release or video; many readers attacked the study because they believed that global warming was a hoax. These people believed the study was a waste of money and that researchers around the globe were exploiting people for profit. They posted incorrect facts to back up their beliefs and to tarnish the work presented. Those who took the opposite view (they believe in global warming or that CO2 pollution is a serious issue) had heated debates with the negative posters. Other people pointed out the limits of the study, namely that the study is United States centric. The e-mails received regarding the WTC project were likewise both positive and negative. The positive e-mails praised the work for the effort of making
Experiences in Disseminating Educational Visualizations
247
the FEA simulation accessible to the public at large through general purpose visualization, and for documenting the tragic events. The e-mails’ authors ranged from civil engineers and simulation experts to relatives of the victims, the latter of which thanked the researchers for confirming the findings put out by the U.S. government. The negative e-mails ranged from disputing the scientific merit of the visualization to accusations of intentional misrepresentation of the events and involvement in some kind of government conspiracy. Comments on YouTube and other websites are similar to the e-mails received. The WTC visualization has been requested for inclusion in the narrative of the National September 11 Memorial and Museum at the World Trade Center.
5
Conclusions
To communicate to a broad audience, who may not necessarily have a visualization background, it is important to make the images as intuitive as possible. Using a spatial context and other realistic features (e.g. the fire added to the WTC video) will make the visualizations more relatable and help to keep the viewer engaged. Similarly, displaying easily perceived events (e.g. CO2 traveling across the country) helps viewers connect to what they are seeing. We also recommend limiting audio and having short and to the point animations in order to maintain user interest. We found that creating a press release with the intention of massive viewership of the visualizations has both positive and negative aspects associated with it. Perhaps the most positive contribution of doing this work is that it increases public awareness of a scientific study or issue and helps to stimulate a dialog between individual of opposing viewpoints. These visualizations help non-domain experts understand complex physical scientific events by delivering the information which would otherwise be difficult to understand. But be prepared for negative, and sometimes harsh comments. The general public tends to respond positively, but only if the ideas presented reinforce their existing views toward the subject matter. For example, those who accept the theory of global climate change or the generally agreed upon story of the events of September 11th tend to find the visualizations interesting and informative. Those who disagree attack the quality of the work. Researchers, and their associated institutions, have a lot to gain and lose as well. Administrators tend to favor any activity which will help their institution gain public attention, particularly when the visualization garners positive public attention. However, the researchers and institutions are putting their reputation on the line. Both the scientific experiment and visualization need to have high fidelity. Even minor factual slip-ups by those in the press who are passing word of the the study, will lead to questioning the credibility of the researchers. These type of visualizations intended for the public often will require hour upon hour of additional work beyond that needed for the initial scientific study. Scientists tend to only be interested in raising their profile within their community and find the additional work to only be a nuisance, lacking in value, and
248
N. Andrysco et al.
unimportant to the real science. To that end, using website hits and video views is perhaps not an accurate way to measure what kind of impact the presented images and videos have had on people as the subject nature and pretty pictures may have been what generated the statistics. There is little methodology for studying the impact of visualizations on large populations and a more formal approach is future work. In the end, we believe that a well done scientific study combined with interesting visuals can have a profound impact on all involved. Though we have received many negative comments, we believe those have come from one extreme viewpoint and constitute a vocal minority. The telling of a factual scientific story has most likely educated countless people, which is what really matters. On a personal note, being involved in a work that garners national attention is an unique and rewarding experience which we recommend to all those willing to put in the extra effort. Acknowledgments. Support for the project was supported by NASA (grants Carbon/04-0325-0167 and NNX11AH86G), the DOE (VACET and grant DEAC02- 05CH11231), and NIH/NCRR Center for Integrative Biomedical Computing (grant P41-RR12553-10). Computational support provided by Purdue’s Rosen Center for Advanced Computing (Broc Seib and William Ansley) and the Envision Center.
References 1. Gurney, K.R., Mendoza, D.L., Zhou, Y., Fischer, M.L., Miller, C.C., Geethakumar, S., de la Rue du Can, S.: High resolution fossil fuel combustion CO2 emission fluxes for the United States. Environmental Science & Technology 43, 5535–5541 (2009) 2. Andrysco, N., Gurney, K.R., Beneˇs, B., Corbin, K.: Visual exploration of the vulcan CO2 data. IEEE Comput. Graph. Appl. 29, 6–11 (2009) 3. Irfanoglu, A., Hoffmann, C.M.: Engineering perspective of the collapse of WTC-I. Journal of Performance of Constructed Facilities 22, 62–67 (2008) 4. Rosen, P., Popescu, V., Hoffmann, C., Irfanoglu, A.: A high-quality high-fidelity visualization of the attack on the World Trade Center. IEEE Transactions on Visualization and Computer Graphics 14, 937–947 (2008) 5. Cotton, W.R., SR, R.A.P., Walko, R.L., Liston, G.E., Tremback, C.J., Jiang, H., McAnelly, R.L., Harrington, J.Y., Nicholls, M.E., Carrio, G.G., et al.: RAMS 2001: Current status and future directions. Meteorology and Atmospheric Physics 82, 5–29 (2003) 6. Hoffmann, C., Popescu, V., Kilic, S., Sozen, M.: Modeling, simulation, and visualization: The pentagon on September 11th. Computing in Science and Engg. 6, 52–60 (2004)
Branches and Roots: Project Selection in Graphics Courses for Fourth Year Computer Science Undergraduates M.D. Jones Brigham Young U. Computer Science [email protected]
Abstract. Computer graphics courses for computer science undergraduates typically involve a series of programming projects. It is a difficult problem to design a set of projects which balance establishment of roots in foundational graphics with exploration of current branches. We posit projects involving rasterizing a triangle with interpolated vertex colors, ray tracing and inverse kinematics solvers as best practices in project design for graphics courses for fourth year computer science undergraduates. We also discuss projects involving tool usage (rather than tool creation) and implementing a full viewing pipeline as worst practices. These best and worst practices are based on three years of project design in one course and survey of projects in similar courses at other universities.
1
Introduction
Project selection for a computer graphics (CG) class designed for undergraduate seniors majoring in Computer Science (CS) as part of a four-year Bachelor’s of Science degree program is a difficult problem. Topics should be selected so that the resulting course conveys the fun and excitement of modern branches of CG while being grounded in the CG roots that sustain the branches. To carry the analogy a little further, a well-designed graphics course might be compared to a healthy tree with deep roots in foundational graphics concepts that anchor the tree and lofty branches into recent CG topics that make the tree interesting and useful. In such a course, students should be asked to do difficult things, feel a sense of accomplishment at having done those things and should connect what they learned to what they see in modern CG–even if the student never touches CG again as part of their profession. At the same time, students going on to graduate research in CG should have a foundation from which they can complete more advanced courses in CG. We believe that accomplishing these objectives requires a grounding in foundational topics together with a deliberate push toward modern topics. Achieving this balance has proven difficult. In this paper, we posit a more precise characterization of “foundational” and “modern” CG topics and use that characterization to clearly state the CG project G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 249–258, 2011. c Springer-Verlag Berlin Heidelberg 2011
250
M.D. Jones
selection problem. We then review projects from three similar courses at other universities and discuss several of our attempts to balance foundational and modern topics at our university. We close with a subjective discussion of best practices drawn.
2
Course Objectives
Upon successful completion of the fourth year CG course, students should: – have had some fun and felt the excitement of computer graphics – seen how solid mathematical foundations lead to elegant code in computer graphics – built their ability to write difficult programs – be able to connect foundational graphics topics with what they see in games and movies.
3
Brief Survey of Undergraduate Graphics Courses
In this section we describe projects assigned in similar courses at other universities. We focus our search on courses that use Shirley’s textbook [1] either as a required or recommended text. Table 1 summarizes the projects in these courses and includes our own for comparison. The Spring 2009 offering of Fundamentals of Computer Graphics at Penn State University taught by Liu included four projects and OpenGL as a common platform for many of the assignments [2]. Students implemented a heightmap shader which took a grayscale image as input and rendered a 3D interview view of that terrain. In the second project, students implemented Catmull-Rom splines on which they built a physics simulation of a roller coaster along those curves. The students also implemented a ray tracer which supported spheres, triangles and textures. In a more open-ended project, students read papers on texture synthesis and implement a method from a paper or invent their own. Table 1. Summary of projects in three undergraduate computer graphics courses which use [1] as a text. The Spring 2011 version of our course is listed for comparison. U. Vic. OpenGL Lerping triangle color Project 2 Roller Coaster L-systems (splines) (transforms) Project 3 Ray tracer 3D world Project 4 Texture synth. Ray tracer Project 5 Particle system Open Project Texture synth. Platform Project 1
PSU OpenGL Heightmap
CalPoly SLO OpenGL Drawing objects 3D model transforms Lighting
BYU XNA Lerping triangle color 3D viewing pipeline IK solver Ray tracer Content pipeline Pick a topic A game
Branches and Roots
251
Introduction to Computer Graphics in Summer 2010 at the University of Victoria taught by A. Gooch [3] included five projects and OpenGL as a common platform. The first project involves linear interpolation of color across the face of a triangle, much the same as the class we teach. The second project uses Lsystems as a vehicle for teaching transformation matrices. In the third project, students construct an OpenGL program which uses OpenGL commands to create a “glass-ball” user interface. Project four is a ray tracer and project five is a particle system which supports fireworks, water falls and other effects. The Fall 2010 offering of Introduction to Computer Graphics at Cal Poly, San Luis Obispo taught by Wood [4] included four well-defined projects and a self-directed final project. The first project involves implementing an objectbased drawing program for rectangles, circles and triangles. The second project involves displaying and transforming 3D objects using rotate, scale and translate with OpenGL commands. Project three adds lighting, shading and perspective transforms to the transforms from project two. Project four asks students to implement an interesting 3D world in which the user can move through the world. The final project is an open-ended projects in which students select a topic and implement it. Recent SIGGRAPH papers, posters and sketches are mentioned as good sources for ideas for the final project.
4
The Problem
The central problem is project selection. The goal is to select a set of projects which give students a grounded foundation (ie, the roots) in the mathematics and algorithms that make graphics work while preparing them to appreciate and perhaps build new algorithms (ie, the branches) on top of existing implementations of foundational graphics algorithms. In this section, we define the problem more precisely by clarifying what we mean by “foundational CG” and “modern CG.” After making both terms precise, we define the problem as: how do we select projects that provide students with a grounding in foundational topics to support further exploration of modern CG topics? 4.1
Definitions
We split CG topics into two groups: modern and foundational CG. The foundational group is characterized by topics that have been implemented in multiple, widely available, forms and those implementations are more or less standardized. The modern CG group is characterized by topics that depend on the existence of implementations of topics from the foundational group. Some foundational topics, such as patterned fill algorithms or vector drawing, for example, do not have strong dependencies between modern CG topics. The modern topics do not include topics that are the subjects of ongoing research. Modern topics, in this setting, means topics for which algorithms are widely available, well understood and generally agreed upon.
252
M.D. Jones
We classify the following topics as foundational CG topics: – Scan conversion by which we mean converting a shape into pixels illuminated on a raster device. – Culling by which we mean removing shapes or pixels from the rendering pipeline because they are – Viewing pipeline by which we mean converting a point in world space coordinates to a point in raster, or screen space, coordinates. – Illumination by which we mean computing color intensities across the face of a shape based on material properties of the shape and a lighting model. – Texture mapping by which we mean mapping an image onto a surface The following are representative of topics in modern CG: – Keyframe animation by which we mean setting position or other attributes of an object at certain points in time and interpolating between those positions over time. This could include smooth-in and smooth-out functions or just include linear interpolation. – Inverse kinematics by which we mean setting up a series of rigid objects connected by joints, defining and target and allowing the resulting object to move toward the target. This can be done in 2D or 3D using a variety of joints. – Collisions by which we mean determining if two pieces of rigid polygonal geometry actually collide. The XNA API includes approximate collisions based on bounding spheres but we mean actual collisions between actual pieces of geometry rather than approximations. Exact collisions can be important in simulations of colliding non rigid geometry like cloth or skin. – Stereo 3D by which we mean achieving the illusion of depth on a screen using stereographic projection and which might be expanded to perception. We have omitted GPU and GPGPU programming and architectures from the list of modern topics because we feel, subjectively, that this topic belongs in a more general course on concurrent programming architectures or in a stand alone course on SIMD or SIMD-like programming and architectures. It would be reasonable to add GPGPU programming for graphics to the list of modern topics. Foundational and modern topics have been left out of each list. Lists given here focus on topics covered in CG courses discussed in this paper.
5
Project Selection at Brigham Young University (BYU)
BYU offers a four year Bachelor of Science degree in CS. BYU is a private university with an enrollment of about 32,000 students of which about 29,000 are undergraduates [5]. The Computer Science Department had a total of 507 declared majors in 2010 with 466 men and 41 women [5]. The graphics class in Computer Science at BYU is a project based class in which lectures and grading are based primarily on projects. Our discussion of each variant of the course focuses on the projects assigned in the course. In
Branches and Roots
253
Table 2. Projects used in the Spring Term offering of the CG class at BYU over 2009-2011 triangle scan conversion viewing ray tracing IK solver game
2009 omitted
2010 part of viewing
2011 with vertex color interpolation
with OpenGL omitted 2D with 1D joints in OpenGL
complete pipeline diffuse spheres 2D with 1D joints in XNA
complete but for points only spheres, triangles, highlights 2D with 1D joints in XNA
each variant, the projects consist of four to six projects in which students implement programs that solve well-defined problems with well-defined specifications. The course also includes a semester project in which the students propose and implement a game (or other 3D interactive graphics application). The author has taught the class during Spring Term (May through mid-June) each year from 2007 to the present year. At BYU, terms have half the duration of a semester and courses worth c credit hours meet 2c hours per week when taught over a term rather than c hours per week when taught during a semester. The reduced duration of a course taught over a term may skew both the students’ and author’s observations about project selection because there is less time between class sessions to complete projects and assignments than there is over a semester despite covering roughly the same material. During the span from 2007 to Spring Term 2011, 79 students took the course. Of those 79 students, 77 were students with declared majors in Computer Science (the other two were both Mechanical Engineering majors). The prerequisites for the class are linear algebra, taught by the math department, and a software design course which is the final course in a sequence of 4 programming classes. For most students, this course is a first experience with visual computing in general and with CG. The following subsections discuss coverage of a specific topic in the class during the 2009, 2010 and 2011 offerings. We review coverage in each class of triangle rasterization, the viewing pipeline, ray tracing, inverse kinematics and a term-long game project. In the next section we discuss lessons learned from these offerings of these classes. 5.1
2009: Shallow Roots
The 2009 offering of the class included four “usage projects” in which students were required to use various implementations of foundational concepts to make an image or short video clip. The usage projects included using tools like Maya or Vue to do Phong shading, displacement maps, bump maps and key framing with POVRAY for ray tracing and making a stereoscopic 3D image pair using any tool. Students were not required to implement a triangle rasterizer and only implemented part of a complete viewing pipeline. Students implemented part of a
254
M.D. Jones
viewing pipeline by using the OpenGL GL SMOOTH shading model to shade triangles, at the GL DEPTH TEST command to accomplish depth buffering. This left perspective viewing, camera transforms and model transforms for students to implement. Ray tracing was omitted from this version of the course but students did implement a 2D inverse kinematics solver. The term project consisted of implementing an interactive application which included a 3D world, multiple points of view and some form of control. Discussion. The usage projects did not meaningfully contribute to students’ understanding of the mathematics and algorithms behind Phong shading, displacement maps, bump maps and key framing. The usage projects helped students understand how to set parameters in each of these models but did not lead to understanding of those models much beyond what might be covered in a lecture. Other projects were minimized in order to allocate more time to usage projects. Students did not implement their own triangle rasterization routine which meant that the details of this part of the viewing pipeline remained opaque to them. Students did not implement a depth buffer either. The loss of triangle rasterization as a project seemed most unfortunate because converting a single triangle into illuminated pixels on a screen using can be used to understand GPU architectures. The IK solver was a focused experience in understanding one aspect of animation. The project seemed to have the right scope and difficulty and provided a good context for in class discussions of other approaches to animation including key framing and motion capture. 5.2
2010: Unbalanced Roots
In 2010 an effort was made to deepen students’ experiences with foundational topics. Usage projects were dropped (both for this and future offerings) and students were allowed to use fewer constructs provided by a graphics API, like OpenGL, in their viewing pipeline projects. At the same time, we switched the student project development platform from OpenGL to XNA. This was done to give students easier access to infrastructure needed to include sound, video game controllers and textured models in their term game projects. The intension was not to make the class a class about XNA but rather th use XNA as a platform. Students implemented triangle rasterization as part of a complete viewing pipeline. Triangle rasterization was done a on triangle-by-triangle basis rather than active edge tables in order to maintain a closer connection to GPU architectures. The first version of the viewing pipeline included flat shaded triangles with Z-buffering and orthographic perspective. Shading and z-buffering were added as extensions to the inner loop of the triangle rasterizer. The second version of the viewing pipeline added Phong shading (per-pixel illumination based on interpolating vertex normals) and translation. Students completed a third version
Branches and Roots
255
of the viewing pipeline which included perspective projection and camera transformations which were also added as extensions to the triangle rasterizer. At the end of this series of projects, students had implemented a complete viewing pipeline from parsing the file all the way to rotating a shaded model in 3D. After the viewing pipeline projects, students could implement either an IK solver or a ray tracer but were not required to do both. Most students implemented a simple 2D IK solver rather than a ray tracer. This was most likely because the IK solver was discussed in class first. A new content pipeline project required students to create and import a textured 3D model into an interactive application which took input from something other than a keyboard or mouse. The purpose of this project was to ease the transition to the term game project. The term game project included new requirements to include sound, a textured 3D object created using some 3rd party tool (like Maya) and to take input from a controller other than a keyboard or mouse. Discussion. Implementing a full viewing pipeline from triangles described in a text file to perspective viewing of Phong shaded objects led to overly deep coverage of the viewing pipeline at the expense of covering ray tracing and animation topics. Switching to XNA allowed us to branch out into modern games architectures by simply allowing students to use sound, controllers and collisions with minimal effort. All of these can, of course, be done in OpenGL by including the right libraries but that process is somewhat tedious and time consuming compared to their use in XNA. Cross-platform deployment was lost in this decision, but cross-platform compatibility is not a significant issue in student course projects. The IK solver worked well as a project and was supplemented outside of class by adding a motion capture day in which students captured their motion using a passive optical motion capture system. The relationship between IK and motion capture is that IK, keyframing and motion capture are different approaches to the problem of setting joint postions to position an end effector at a target position. IK solvers compute the positions positions, in keyframing the joint positions are set by the user and in motion capture the positions are recorded from live action. This lead to a good discussion of motion capture in films such as “Polar Express” and “Avatar”. 5.3
2011: Roots and Branches
The 2011 offering of the course attempted to rebalance the topic distribution compared to the 2010 offering by simplifying the viewing pipeline projects. Ray tracing was reintroduced and extended compared to 2009. In this class we also used XNA as a common implementation platform. Students were first asked to implement a simple triangle rasterizer, as in 2010, which interpolated color across the face of a triangle. Students implemented a complete viewing pipeline with perspective and camera motion but only for points in space rather than shaded triangles. We approached the 2D IK solver project as a chance to understand model transforms in addition to an exploration
256
M.D. Jones
of IK concepts. The ray tracing project was required but extended, compared to the 2009 requirement, to include specular highlights and ray-triangle intersections. Finally, the content pipeline project from 2010 was reused in 2011. Discussion. Projects involving rasterizing a single triangle with interpolated vertex colors along with a viewing pipeline for points in space provides good coverage of both rasterization and viewing. Understanding scan conversion and viewing provides a good starting point for understanding GPU architecture, vertex shaders and fragment shaders. These projects omit implementation of a shading model but this will be done in the context of a ray tracer rather than rasterization. Ray tracing provides a foundation for discussion global illumination in movie and game production. Foundational roots in ray tracing enabled in-class discussion of directional occlusion caching in the film “Avatar” based on [6]. Directional occlusion caching invovles sending out rays from points in the scene to determine occlusion in various directions. This is similar to a path tracer in which multiple shadow rays are sent for each collision point. Caching directional occlusion information using spherical harmonics, as in [6],
6
Best Practices
The following best practices are drawn from the projects listed for CG courses at other universities as well as our own experiences. 6.1
Rasterizing a Triangle
In our course, students are asked to write a program which draws a single triangle to the screen one pixel at a time. Color is linearly interpolated across the face of the triangle. In this project, students do not implement active edge tables. Instead, their program consumes a single triangle at a time in a manner similar to the algorithm given in section 8.1.2 of [1]. We have found that triangle rasterization with linearly interpolated vertex colors makes a good first project in a CG course. It is also the first project in Gooch’s CG course at the University of Victoria. It is a good first project because it is a simple context in which students can have their first exposure to corner cases (like vertical or horizontal edges), floating point precision issues and linear transforms. Triangle rasterization and associated processes lie at the foundation of rendering pipelines like GPUs. Once students understand this process, they are in a better position to understand the apparent oddities of GPU architectures. For example, it is not difficult for students to realize that the rendering of each triangle is independent (except for the depth buffer) and that parallelizing the process would not be difficult.
Branches and Roots
6.2
257
Ray Tracing
The ray tracing project involves casting rays to draw spheres and triangles using Phong illumination. Unlike projects in similar CG courses at the University of Victoria and Penn State, we do not map textures onto the triangles. This reduces the visual appeal of the final image. Ray tracing is a good project for a CG course because it allows students to see mathematics applied in a visual way to make images. Ray tracing can push students’ understanding of viewing transforms. Ray tracing can be a fun experience in implementing the mathematics of intersections and lighting to create a picture from nothing. Students occasionally extend the project simply because they were interested in improving the quality of their results. Ray tracing makes a good foundation for discussion issues in global illumination and approaches to resolving those issues in the context of game or film production. Ray tracing is a good foundation because students are then in a position to understand the limitations of ray tracing, such as color bleed, and this motivates discussions of other global illumination technicues, such as radiosity. 6.3
IK Solver
Among the three university courses surveyed, the IK solver project is unique to our version of a CG course. In this project, students implement an IK solver with 1D revolute joints which rotate about the Z-axis so that the IK arm remains in the XY plane. We use the transpose of the Jacobian to weight rotation values. This project can be an experience in creating a program which appears autonomously grasp a target. Like L-systems in Gooch’s University of Victoria course, an IK solver can be a chance to learn model transforms and the importance of performing model transforms in the right order. The IK solver can help students connect mathematics to programming and to solving an interesting problem. Students appear to pick up the intuitive meaning of the Jacobian fairly well despite not having a course in vector calculus. Restricting rotation to the Z-axis simplifies the project. Understanding IK solvers can be used as a foundation for discussing rigging and IK joints in 3D animation packages.
7
Worst Practices
We have tried a few projects which did not contribute to the course objectives. Two are discussed here. A complete viewing pipeline with camera and model transforms, Phong illumination and Gourand, or per primitive, shading using interpolated vertex normals seemed to require excessive time in and out of the classroom while not significantly contributing to course objectives. Students struggled to grasp interpolating a normal, depth and color across a triangle while simultaneously having their first experience with depth buffering and illumination models for the first time. Rather than split the project into many parts which span about half of the course duration, we dropped many of the topics but kept triangle rasterization.
258
M.D. Jones
Students reported that usage projects, in which students use 3D modeling tools to explore implementations of foundational ideas, did not contribute to course objectives much beyond what was accomplished in class lectures. These include projects in which students Phong shade a sphere, bump map a sphere or key frame a simple animation. We have dropped these projects and have no plans to reintroduce them.
8
Summary
Based on subjective feedback from students as well as a survey of similar classes, we believe that the following course projects contribute to building students’ roots in computer graphics while allowing them to branch into modern topics: triangle rasterization with interpolated vertex colors, a viewing pipeline for vertices which allows 3D model and camera transforms, ray tracing triangles and spheres using a simple lighting model and an IK solver based on the transpose of the Jacobian with 1D revolute joints without texture mapping. An open ended concluding project based on writing a game is also useful. Our conclusions in this paper have been necessarily tentative and subjective. It would be interesting to define and measure a set of metrics for determining how well CG course projects contribute to learning outcomes. Such metrics might lay a foundation for more principled design of CG course projects.
References 1. Shirley, P., Ashikhmin, M., Gleicher, M., Marchner, S., Reinhard, E., Sung, K., Thompson, W., Willemsen, P.: Fundamentals of Computer Graphics, 2nd edn. A. K. Peters, Wellesley (2005) 2. Lui, Y.: CMPSC 458: Fundamentals of computer graphics at the Pennsylvania State University (2010), http://vision.cse.psu.edu/courses/CMPSC458/ cmpsc458.shtml (accessed, May 2011) 3. Gooch, A.: CSC305: Introduction to 3d computer graphics at the University of Victoria (2010), http://webhome.csc.uvic.ca/~agooch/teaching/CSC305/ (accessed, May 2011) 4. Wood, Z.: CSC-CPE 471: Introduction to computer graphics at California Polytechnic State University, San Luis Obispo (2010), http://users.csc.calpoly.edu/ ~zwood/teaching/csc471/csc471.html (accessed, May 2011) 5. Brigham Young University: Y facts: BYU demographics (2011), http://yfacts. byu.edu/viewarticle.aspx?id=135 (accessed, May 2011) 6. Pantaleoni, J., Fascione, L., Hill, M., Aila, T.: Pantaray: fast ray-traced occlusion caching of massive scenes. ACM Trans. Graph. 37, 1–37 (2010)
Raydiance: A Tangible Interface for Teaching Computer Vision Paul Reimer, Alexandra Branzan Albu, and George Tzanetakis University of Victoria Victoria, BC, Canada [email protected], [email protected], [email protected]
Abstract. This paper presents a novel paradigm for prototyping Computer Vision algorithms; this paradigm is suitable for students with very limited programming experience. Raydiance includes a tangible user interface controlled by a spatial arrangement of physical tokens which are detected using computer vision techniques. Constructing an algorithm is accomplished by creating a directed graph of token connections. Data is processed, then propagated from one token to another by using a novel Light Ray metaphor. Our case study shows how Raydiance can be used to construct a computer vision algorithm for a particular task.
Imagine you are an undergraduate student registered in a Computer Vision class. You need to prototype a multi-step computer vision process for your class project. You have limited experience with programming environments such as Matlab and C++. For each processing step, many algorithms are available through the Matlab Image Processing Toolbox and OpenCV[2]. You need to test all these algorithms in order to make an informed choice. You also need to write the software that integrates all selected algorithms into a computer vision system. Each algorithm typically works with several parameters, thus when the complexity of the computer vision task increases, the combinatorial difficulty of selecting the best algorithms and optimizing their parameters may easily grow out of control. The scenario described above represents a typical bottleneck in project-based undergraduate and even Masters-level Computer Vision classes. This raises the following questions: Can we teach Computer Vision with less emphasis on the low-level programming tasks? Can we teach Computer Vision to students with limited experience in programming? During the last two decades, significant progress has been made in major areas of computer vision, with numerous robust algorithms being developed for image enhancement, segmentation, motion tracking and object recognition. Implementations of such algorithms are available through the MATLAB Image Processing Toolbox and the OpenCV library[2]. However, the task of integrating existing algorithms into a functional system is not trivial, since one needs to program the glue code to link these algorithms. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 259–269, 2011. c Springer-Verlag Berlin Heidelberg 2011
260
P. Reimer, A. Branzan Albu, and G. Tzanetakis
This paper proposes a new paradigm called Raydiance to assist novice programmers in the design, testing, and visualization of Computer Vision algorithms. Raydiance includes a tangible user interface controlled by a spatial arrangement of physical tokens which are detected using computer vision techniques. Constructing an algorithm is accomplished by creating a directed graph of token connections. Data is processed, then propagated from one token to another by using a novel Light Ray metaphor. We show how Raydiance can be used to construct a computer vision algorithm for a particular task. Raydiance makes use of image processing techniques in OpenCV[2], and libCVD[1]. The remainder of our paper is structured as follows. Section 1 discusses similar approaches and implementations of visual programming interfaces used for rapid software prototyping, and the foundations of tangible computing interfaces using fiducial markers. Section 2 describes the proposed approach for the design of Raydiance. Section 3 presents a case study which consists of a detection task implemented in Raydiance. Section 4 draws conclusions and outlines future work directions.
1
Related Work
Raydiance is based on a dataflow programming paradigm. Unlike other visual programming environments, Raydiance uses fiducial markers to create a tangible interface which avoids the use of the keyboard and mouse. Concepts of dataflow programming are reviewed in section 1.1. Visual programming environments are discussed in section 1.2. Section 1.3 explains how fiducial markers can be used to implement a tangible computing interface. 1.1
Dataflow Programming
The structuring of computer programs as a sequence of interconnected modules is known as dataflow programming. This approach was proposed by Morrison[12] in the early 1970s. This concept was first used to design, implement and visualize processes involved in processing banking transactions. In addition to the ability to visualize algorithms that have a complex dependency graph, dataflow programming also presents an efficient model for processing data. Kernels operate on blocks of data, and are combined to form a directed graph of data dependencies, often using a visual programming environment. The resulting network can be scheduled to process the data in parallel where there are no data dependences, or to dynamically allocate processing resources to prioritized tasks. The flow-based programming paradigm has seen several variants and many different implementations. Johnston, Hannah and Millar[9] give a history of the transition from fine-grained hardware-focused dataflow programming to more coarse-grained, modular designs. One of the most significant advances in dataflow programming is the emergence of visual programming environments tailored towards building dataflow networks.
Raydiance: A Tangible Interface for Teaching Computer Vision
1.2
261
Visual Programming
Visual programming environments present a number of benefits to users: intuitive visualization of control flow, no requirement for mastering a computer language grammar/syntax, and the potential for interactive control of parameters and variations of control flow without the need for making changes in source code. For rapid software prototyping, Zhang, Song and Kong describe the benefits of visual programming environments in [14], while Lomker et al. [11] present a visual programming environment (with elements of dataflow programming) for designing a computer vision algorithm. 1.3
Tangible, Fiducial-Based Interfaces
A tangible interface for controlling a computer describes a setup where affordances are provided by physical components of the interface. This is in contrast to the use of keyboard/mouse driven interfaces which employ the same hardware to control a variety of software. A tangible interface embodies a direct manipulation paradigm. This allows users to physically manipulate a hardware setup, which in turn affects the behaviour of a software application. Tangible interfaces are an emerging trend in computing, and are especially common in interactive, multimedia installations. Recently, tangible computing interfaces using tokens detected by computer vision techniques–such as the reacTable proposed by Kaltenbrunner, Jorda, and Geiger [10]–have been tailored specifically for controlling multimedia processing algorithms. The shape, translation, and rotation of tokens placed on a planar desktop surface control some aspect of a multimedia processing pipeline. Early versions of these interfaces had an audio focus, to complement the visual process of designing an audio processing interface (e.g. a musical instrument). Tokens designed specifically for detection, classification, and spatial location/orientation are known as fiducial markers. Fiducal marker detectors and trackers operate by identifying known objects with distinct visual properties. A common choice is a hierarchy of shapes contained within the fiducial design, represented as a region adjacency graph (RAG), described by Costanza et al in [6] [7]. Bencina et al [5] improve on the topological fiducial detector. We translate the concept of a tangible, fiducial marker-based interface typically used in artistic, multimedia applications to an educational environment used for prototyping computer vision algorithms using dataflow programming. We use a light ray metaphor to automatically establish connections between stages in a computer vision algorithm. The next section details our proposed approach.
2
Proposed Approach
Raydiance represents kernels of computer vision code via tokens. One might think of these tokens as symbolic visual representations of their associated code.
262
P. Reimer, A. Branzan Albu, and G. Tzanetakis
Fig. 1. Apparatus side-view; inset: top-view
Each token represents a distinct processing task, such as thresholding, background subtraction, etc. The tokens are embodied by fiducial markers which are placed on a planar surface within the field of view of a camera. Physical controls for parametric inputs, and visualizations of the output produced by each kernel, are rendered to the display surface located just underneath the token. The connection between kernels of code is performed geometrically, using a novel light ray metaphor(see 2.1). One should note an interesting duality: computer vision controls the functioning of Raydiance, which in turn is used for prototyping computer vision systems. The current version of the Raydiance uses a planar arrangement of tokens, which are placed on a horizontal surface and filmed with a top-mounted camera as seen in Figure 1. The image plane of the camera is parallel to the planar surface used as the desktop. In the setup shown, the desktop surface and the visualization surface are the same: the desktop surface extends to the corners of a computer screen placed horizontally on a physical desktop, and the camera is aligned to capture all corners of the screen. Figure 1 shows a laptop with the screen fully opened, and fiducial tokens placed directly on the laptop screen. The user interface is designed so that controls for a particular token are drawn directly below the token, and move consistently with the token if the token is displaced.
Raydiance: A Tangible Interface for Teaching Computer Vision
263
The horizontal configuration of the display enables the user to view the desktop from any angle and opens the possibility of collaborative interaction among multiple users. The remainder of this section is structured as follows. Subsection 2.1 discusses the proposed light ray metaphor for token interconnections. Dataflow programming and visualization are discussed in subsection 2.2. Details on fiducial detection and tracking are given in subsection 2.3. 2.1
Light Ray Metaphor
A token-based software prototyping scheme has been proposed before in [14]; this scheme links tokens based on proximity criteria. This approach does not scale well for complex algorithms, since proximity-based connections are limited to 1DOF. Systems such as the reacTable[10] enable a slow, gradual building of audio processing systems, since the placement of each token has a global effect on the entire canvas; reconfiguring certain processing steps requires the repositioning of multiple tokens. For prototyping computer vision systems, more flexibility is desired. That is, one shold be able to add/remove processing steps by displacing as few tokens as possible. This paper proposes therefore a new approach for linking tokens together and reconfiguring them with ease. We use a light ray metaphor for constructing directed graphs assembled from tokens located on a surface which represents a desktop. Tokens are either connected to, or disconnected from, a graph; a token may be a node in one or zero graphs. Each token that is connected to a graph searches for connections to tokens which will accept as input a similar data structure to that which the token produces. A connection is determined according to an intersection criterion, which for our application is represented by a light ray model. Each output port of the token emits a ray in the plane described by the desktop surface. The 2D spatial location of each token located on the desktop surface is used as the origin point for the ray, and the rotation of the token about the axis normal to the desktop surface, with respect to the coordinate system of the desktop surface is used to determine the direction of the ray. Many input and output rays may be associated with the same token. Figure 2 shows a simple example usage of ”prism” tokens for the decomposition of a colour image into three channnels, followed by the recomposition of two of these channels. Therefore, Raydiance can be customized by choosing offsets for both translation and rotation of each output ray. The translation and rotation offsets are used to separate the outputs from each token; this permits a token to direct each output to multiple distinct tokens, by either varying the translation offset to form parallel rays, or varying the rotation offset to create a fan effect, or any arbitrary combination suitable to the application. A constant translation offset can add contextual meaning to the rays displayed on the visualization screen. For example, this can make it appear as if the rays emanate from image data below the fiducial tokens, rather than in the zero-offset case where rays are
264
P. Reimer, A. Branzan Albu, and G. Tzanetakis
Fig. 2. Multiple output, multiple input “prism” tokens
directly emanating from the token. Figure 2 shows a constant translation offset to the middle of the right-hand side of each token, and a 40-degree rotation offset applied incrementally to each output ray. Incident rays intersecting the bounds of another token denote a connection between the token that emitted the ray and the incident token. The connection is triggered by a positive result of a line-segment intersection test. The intersection test is illustrated in Figure 4. Let R = (R.a, R.b) be a ray emanating from a ’radiating’ token and m the number of sides (typically m = 4) of the token we are considering for intersection. For every side i, i = 1..m we compute the intersection point between the side and the ray R. The green circle indicates the intersection point with minimum distance, the orange circle denotes an alternate valid intersection at a greater distance, and the red circles represent invalid intersection points. The side that provides a valid intersection point located at the shortest distance from the ’radiating’ token is selected to establish a connection between the tokens. If no valid intersection points are found, then the two tokens are not connected. 2.2
Dataflow Programming
Interconnecting tokens results into a graph of computer vision kernels. The graph is used to represent an image/video processing algorithm, where each node of the graph represents a series of data transformations. Tokens represent instantiations of a particular type of tranformation. Each token performs an image/video processing task, which can be completed in real-time for 640x480 pixel images at
Raydiance: A Tangible Interface for Teaching Computer Vision
265
Fig. 3. Using rotation to select from two similar tokens. Dashed lines and translucent images denote an (inactive) alternate processing path. The output of the alternate path is not shown.
Fig. 4. Using a line-segment intersection test to determine token interconnection
266
P. Reimer, A. Branzan Albu, and G. Tzanetakis
30 frames per second (fps). The input data is a video stream collected from one of multiple attached cameras, clocked at the specified framerate for that camera. Output data is collected from the final node(s) of the graph. To enable efficient prototyping of computer vision systems, several alternative implementations of common computer vision tasks (e.g. background subtraction, feature extraction) are included in Raydiance. This enables the direct comparison of two (or more) algorithms designed for the same task by comparing their visual output obtained for the same input data. An example of comparison of two thresholding algorithms is shown in Figure 3. Rotating the first token selects between two alternative processing paths. Data can be visualized at each stage of processing, in the spatial proximity of the token for that processing stage. Parameter values for the specific kernel represented by the token are also shown (see Figure 5). 2.3
Fiducial Detection and Tracking
A fiducial detector based on binary shape detection was chosen to avoid potential issues of colour imbalance due to ambient lighting variations. The graph building framework supports the use of several open-source detectors, and makes it simple to replace these detectors with alternate methods or an improved version of the same detector. The current version of Raydiance uses Libfidtrack[4], the same detector used in reacTable[10]. New fiducials can be generated automatically using the genetic algorithm proposed in [3], implemented in the open-source software Fid.Gen[13]. A tracker maintains a list of detected fiducials, switching among the ’found’, ’lost’, and ’updated’ states for each fiducial depending on the number of consecutive frames in which a fiducial has been detected. A similar tracker is also
Fig. 5. Rotated fiducial marker and associated visualization
Raydiance: A Tangible Interface for Teaching Computer Vision
267
Fig. 6. A Raydiance implementation of an algorithm for hand localization
maintained for the connections between fiducials. Only fiducials present in the fiducial tracker list serve as candidates for ray intersection tests, and these intersections are recomputed with each new video frame.
3
Case Study
This case study considers the task of detecting a human hand within each frame of a video stream from a webcam. This task is sufficiently simple to be suitable for a beginner-level computer vision course project, and it is interesting because of its applicability to real-world applications. Hand detection can be used, for example, to control a computer program using simple gestures made by a waving a hand in front of a camera. To constrain the problem, it is assumed that only one hand is present in a camera frame, and that the bare skin is sufficiently lit to be visible in the video stream recorded by the camera. A multi-step hand detection algorithm is implemented in Raydiance as follows. Step A (module 2 in Figure 6) Gaussian blur is applied to the RGB image to remove small lighting artifacts and noise. Step B (modules 3-4 in Figure 6) represents a colour space transformation, which is a preprocessing step for colour-based skin detection. This colour space transformation is presented in [8]. Step C (modules 5.1, 5.2, and 5.3 in Figure 6) implements three tests in [8] in order to classify each pixel in the current frame as skin-colored or not. Each test produces a binary mask of pixels. In step D (module 6 in Figure 6) the results
268
P. Reimer, A. Branzan Albu, and G. Tzanetakis
of the three tests are compared and integrated, and the centroid of the hand is computed. The last step (module 7 in Figure 6) computes the bounding box and the contour of the hand. The hand detection algorithm is prototyped in Raydiance by selecting the appropriate modules and interconnecting them. No additional ’glue code’ is necassary. A student with little programming experience benefits from being able to understand how algorithms work by studying their behaviour to different inputs, and by comparing algorithms designed for the same task (i.e the tests for skin detection).
4
Conclusion
This paper presents a novel paradigm for prototyping Computer Vision algorithms which is suitable for students with very limited programming experience. From an educational point of view, this enables decoupling the relatively steep learning curve in learning programming from learning how computer vision algorithms work and behave to different inputs. Therefore, we argue that this paradigm is well-suited for teaching computer vision to freshmen students in engineering and computer science as part of design courses. Moreover, the same paradigm can be used for teaching computer vision for non-technical audiences, such as students in visual arts etc. The technical contribution of the paper consists in a new strategy for interconnecting tokens in a tangible interface via a light ray metaphor. Future work will explore the scalability of this novel approach to more complex computer vision systems and large tabletop displays.
References 1. Cvd projects (2010), http://mi.eng.cam.ac.uk/~er258/cvd/index.html 2. Opencv wiki (2010), http://opencv.willowgarage.com/wiki 3. Bencina, R., Kaltenbrunner, M.: The design and evolution of fiducials for the reactivision system. In: Proceedings of the 3rd International Conference on Generative Systems in the Electronic Arts (3rd Iteration 2005), Melbourne, Australia (2005) 4. Bencina, R., Kaltenbrunner, M.: libfidtrack fiducial tracking library (2009), http://reactivision.sourceforge.net/files 5. Bencina, R., Kaltenbrunner, M., Jorda, S.: Improved topological fiducial tracking in the reactivision system. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) - Workshops. IEEE Computer Society, Washington, DC (2005) 6. Costanza, E., Robinson, J.: A region adjacency tree approach to the detection and design of fiducials. In: Video, Vision and Graphics, pp. 63–69 (2003) 7. Costanza, E., Shelley, S.B., Robinson, J.: Introducing audio d-touch: A tangible user interface for music composition. In: 6th Intl. Conference on Digital Audio Effects, (DAFX-03) (2003) 8. Gomez, G., Morales, E.F.: Automatic feature construction and a simple rule induction algorithm for skin detection. In: Proc. of the ICML Workshop on Machine Learning in Computer Vision, pp. 31–38 (2002)
Raydiance: A Tangible Interface for Teaching Computer Vision
269
9. Johnston, W.M., Hanna, J.R.P., Millar, R.J.: Advances in dataflow programming languages. ACM Computer Survey 36(1), 1–34 (2004) 10. Jord` a, S., Geiger, G., Alonso, M., Kaltenbrunner, M.: The reactable: Exploring the synergy between live music performance and tabletop tangible interfaces. In: Proceedings Intl. Conf. Tangible and Embedded Interaction, TEI (2007) 11. Lomker, F., Wrede, S., Hanheide, M., Fritsch, J.: Building modular vision systems with a graphical plugin environment. In: International Conference on Computer Vision Systems, p. 2 (2006) 12. Morrison, J.P.: Data responsive modular, interleaved task programming system vol. 13(8) (January 1971) 13. toxmeister. Fid.gen reactivision fiducial generator (2009), http://code.google. com/p/fidgen 14. Zhang, K., Song, G.-L., Kong, J.: Rapid software prototyping using visual language techniques. In: IEEE International Workshop on Rapid System Prototyping, pp. 119–126 (2004)
Subvoxel Super-Resolution of Volumetric Motion Field Using General Order Prior Koji Kashu1 , Atsushi Imiya2 , and Tomoya Sakai3 1
School of Advanced Integration Science, Chiba University Institute of Media and Information Technology, Chiba University Yayoicho 1-33, Inage-ku, Chiba, 263-8522, Japan Department of Computer and Information Sciences, Nagasaki University Bunkyo-cho, Nagasaki, Japan 2
3
Abstract. Super-resolution is a technique to recover a high-resolution image from a low resolution image. We develop a variational superresolution method for the subvoxel accurate volumetric optical flow computation combining variational super-resolution and the variational optical flow computation for the super-resolution optical flow computation. Furthermore, we use the prior with the fractional order differentiation for the computation of volumetric motion field to control the continuity order of the field. Our method computes the gradient and the spatial difference of a high-resolution images from these of low-resolution images directly, without computing any high resolution images which are used as intermediate data for the computation of optical flow vectors of the high-resolution image.
1
Introduction
We develop an algorithm for the super-resolution volumetric optical flow computation. Super-resolution of images is a technique to recover a high-resolution image from a low-resolution image and/or image sequence [5]. Volumetric optical flow is motion field of the volumetric image. Therefore, super-resolution optical flow computation yields the motion field of each point on the high-resolution volumetric image from a sequence of low-resolution volumetric images. Our method computes the gradient and the spatial difference of a high-resolution volumetric images from these of low-resolution images directly, without computing any high-resolution volumetric images which are used as intermediate data for the computation of motion flow vectors of the high-resolution image. We assume that the resolution reduction system is described by the linear pyramid transform. The discrete pyramid transform reduces the size of images in the pyramid hierarchy. In the higher level of the image-pyramid hierarchy, an image is transformed to very small images. Therefore, for the recovery of images and optical flow from images in the lower level in the image pyramid hierarchy of the image pyramid, we are required to recover the original images from an image of icon size. To solve the problem, we need to consider additional mathematical G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 270–279, 2011. c Springer-Verlag Berlin Heidelberg 2011
3D Super-Resolution
271
constraints and priors to recover clear images and optical flow. The subpixelaccurate optical motion field computation is required to compute the motion field vector of the inter grid points. For multiresolution optical flow computation, the motion flow field computed in the coarse grid system is propagated to the field in the finer grid. This propagated field is used as the first estimate for the accurate optical flow computation in the finer grid. Interpolation [4] is a fundamental technique for estimation of subpixel values of images [5]. For this interpolation, the spline technique is a typical method. Furthermore, spline interpolation is a classical method for super-resolution of images and shapes. Spline-based interpolation for super-resolution [4] is derived as a model fitting problem with the least-square and energy-smoothness criteria for model fitting term and priors in variational formulation [10,6]. The Horn-Schunck type optical flow computation adopted the least-square and energy-smoothness criteria for model fitting term and priors, respectively [11,15]. Recently, to deal with sparsity of optical flow vectors and images, both for optical flow computation and super-resolution, respectively, L1 -constraint on model fitting term and total variation (TV) [1] for priors, respectively, are widely used [7,8]. Efficient methods for solving L1 -TV regularisation are developed [14,9]. As the first step in direct computation of high-resolution volumetric motion field from a low-resolution image sequence, we adopt the classical least-square and energy-smoothness criteria on model fitting term and prior for variational super-resolution and variation motion field super-resolution, respectively. The pyramid transform reduces the size of the image if we use the same size of voxels for image representation [12,13]. If we use the same size of the image landscape for the results of image pyramid transform, reduction by the pyramid transform acts as low-pass filtering. We accept the pyramid-transform-based image observation system.
2
Fractional Order Derivatives and Pyramid Transforms
Using the Fourier transform pair 1 f (x, y, z)e−i(xξ+yη+zζ) dxdydz, F (ξ, η, ζ) = 2π 3/2 R3 1 f (x, y, z) = 3/2 F (ξ, η, ζ)ei(xξ+yη+zζ) dξdηdζ, 2π R3 we define the operation Λ as 1 ( ξ 2 + η 2 + ζ 2 )F (ξ, η)ei(xξ+yη+zζ) dξdηdζ. Λf (x, y, z) = 3/2 2π R3
(1) (2)
(3)
The operator Λ satisfies the relation Λ2α = (−Δ)(Λ)2ε = (−Δ)(−Δ)ε for α = 1 + ε where 0 < ε < 1. Furthermore, we have the equality |∇f |2 dxdydz = |Λf |2 dxdydz, (4) R3
R3
272
K. Kashu, A. Imiya, and T. Sakai
since
|f | dxdydz = 2
R3
R3
|F |2 dξdηdζ.
(5)
For function f (x), x = (x, y, z) , the pyramid transform R of the factor 2 and its dual transform E are expressed as w(x)f (2x − y)dxdydz, (6) Rf (x, y, z) = R3 x−y w(x)g dxdydz, (7) Eg(x, y, z) = 23 2 R3 where w(x) = w(−x) > 0. These operations satisfy the rection Rf (x)g(x)dxdydz = f (x)Eg(x)dxdydz. R3
3
(8)
R3
Optical Flow Computation
Setting the total derivative a spatio-temporal image f (x, y, t) to be zero with respect to the time argument t, we have the equation fx u + fy v + fz w + ft = 0,
(9)
dy dz where u = (u, v, w) = (x, ˙ y, ˙ z) ˙ = ( dx is the motion of each point. dt , dt , dt ) Therefore, the motion u = (u, v, w) of the point x = (x, y.z) is the solution of eq. (9) which is singular. The mathematical properties of eq. (4) on the operator Λ allows us to focus on variational optical flow computation in the form {(∇f u + ∂t f )2 + κ(|Λα u|2 + |Λα v|2 + |Λα w|2 )}dxdydz, (10) Jα (u) = R2
for κ ≥ 0 and α = 1 + ε where 0 ≤ ε < 1 as a generalization of the energy functional of the Horn-Schunck method [11] such that {(∇f u + ∂t f )2 dxdydz + λ(|∇u|2 + |∇v|2 + |∇w|2 )}dxdydz. (11) J(u) = R2
These energy functionals lead to the following definition. Definition 1. We call the minimizer of eq. (10) the alpha optical flow Since Λ = Λ∗ , the Euler-Lagrange equation of eq. (10) is Λ2α u +
1 (∇f u + ∂t f )∇f = 0. κ
(12)
3D Super-Resolution
273
Specially, for α = 1, 32 , 2, the Euler-Lagrange equations are 1 (∇f u + ∂t f )∇f = 0, κ 1 ΔΛu − (∇f u + ∂t f )∇f = 0, κ 1 Δ2 u + (∇f u + ∂t f )∇f = 0. κ Δu −
(13) (14) (15)
since Λ2 = −Δ, Λ3 = −ΔΛ, and Λ4 = Δ2 .
4
Subvoxel Volumetric Motion Field Computation
Our purpose is to compute u which minimises the criterion (Rf − g)2 + κ|Λα f |2 S(u) = R2
+(∇f u + ∂t f )2 + λ(|Λα u|2 + |Λα v|2 + |Λα w|2 ) dx.
(16)
If λ 1, these equations can be approximately separated to 1 1 1 1 E(Rfx − gx ) = 0, Λ2α fy − E(Rfy − gy ) = 0, κ σ κ σ 1 1 1 2α 2α Λ fz − E(Rfz − gz ) = 0, Λ ft − E(Rft − gt ) = 0, κ σ κ 1 2α Λ u + (∇f u + ∂t f )∇f = 0, λ
Λ2α fx −
(17)
since Λ = Λ∗ .
5
Numerical Scheme
Using semi-explicit discretisation of diffusion equation, ∂ 1 f = Λ2α f + E(g − Rf ), ∂τ κ
(18)
we have the discretisation as (l+1)
fkmn +
Δτ
κ
(l)
(ERf (l+1) )kmn = fkmn + Δτ (Λ2α f (l) )kmn +
Δτ
κ
(Eg)kmn ,
(19)
This equation is decomposed into two steps; (l)
(l)
hkmn = fkmn + Δτ (Λ2α f (l) )kmn + (l)
(l+1)
hkmn = fkmn +
Δτ
κ
Δτ
(ERf (l+1) )kmn
κ
(Eg)kmn ,
(20) (21)
274
K. Kashu, A. Imiya, and T. Sakai
Furthermore, eq. (21) is solved by the iteration Δτ (l+1,s+1) (l) (ERf (l+1,s) )kmn . = hkmn − fkmn κ
(22)
Applying this algorithm to 12 gx , 12 gy , 12 gz and gt , we have fx , fy , fz and ft , respectively. Then, using these solutions, we compute the dynamics for optical flow computation. The semi-explicit discretisation of diffusion equation ∂u 1 = −Λ2α u − (∇f u + ft )∇f ∂τ λ for optical flow computation derives the the discretisation as Δτ Δτ (l+1) (l) (I + Skmn )ukmn = ukmn − Δτ (−Λ2α u(l) )kmn − ckmn , λ λ
(23)
(24)
for l ≥ 0, where Skmn = (∇f )kmn (∇f ) kmn and ckmn = (∂t f )kmn (∇f )kmn . Since 1 Δτ (I + (I +Tkmn ), Tkmn = trSkmn ·I −Skmn (25) Skmn )−1 = λ I + Δτ λ trSkmn we have 1
(l+1)
ukmn = (l)
I+
(l)
Δτ λ trSkmn
(I + Tkmn )hkmn
(l)
hkmn = ukmn − Δτ (P u(l) )kmn −
(26)
Δτ
(27) ckmn λ Fourier transform of differential operations, which is easily implemented using the Fast Fourier Transform (FFT) and the filter theory [2,3]. We have the relation f (α) (x) =
∞
(in)α an exp(inx)
(28)
n=−∞
for f (x) = f (x + 2π). Let fn and Fn for 0 ≤ n ≤ (N − 1) be the discrete Fourier transform pair such that N −1 N −1 1 mn
1 mn
fm exp −2πi Fm exp 2πi Fn = √ , fn = √ . N N N m=0 N m=0
Since
N −1 m
1 mn
1 (fn+ 12 − fn− 12 ) = √ Fm exp 2πi , i sin π 2 N N N m=0
(29)
(30)
we can compute (Λα f )kmn = √
N
Λ(k m n ) =
N −1
1 3
k ,m ,n =0,
kk + mm + nn Λ(k m n )α Fk m n exp 2πi , N
k m n sin2 π sin2 π sin2 π N N N
(31)
3D Super-Resolution
for Fkmn = √
N −1
1 N
3
f
k m n
k ,m ,n =0,
kk + mm + nn exp 2πi . N
275
(32)
The discrete version of the pyramid transform and its dual are 1
Rfkmn =
wk wm wn f2k−k , 2m−m ,2n−n ,
(33)
k ,m n =−1 2
Efkmn = 23
k ,m ,n =−2
where w±1 = 14 and w0 = (n − n ) are integers.
6
1 2
wk wm wn f k−k , m−m , n−n , 2
2
2
(34)
and the summation is achieved for (k −k ), (m−m ),
Examples
Our main objective is to recover a high-resolution optical flow field from a lowresolution image sequence. We compare the results of our method computed from g = Rf and the optical flow field computed from f using the same optical flow computing algorithm. In our case, we compute the optical flow field of f using the Horn-Schunck method. Setting ud (x, t) and um (x, t) to be the optical flow fields obtained as a result of super-resolution and those computed from the original u u image sequence, respectively, we define the values θ(x, t) = cos−1 |udd||umm | can be defined if the norms of both us and u are nonzero. In the results, we let the error be 0 if us or u is zero. Using θ(x, y, t), we evaluated 1 θ(x, t)dxdydz (35) avrθ(t) = |Ω| Ω where |Ω| is the area of the domain Ω and Tmax is the maximum number of frames of an image sequence. Since the optical flow field is a vector-valued image, the pointwise energy of optical flow is e(x, t) = |u(x, t)|2 , and the energy function of the optical flow field is |u(x, t)|2 dxdydz. (36) E(t) = R3
The gain of super-resolution optical flow is G(t) = −10 log10
{Energy of the result of super-resolution at time t} . {Energy of the original image at time t}
(37)
Furthermore, we define the norm error between the optical flow field of the ground truth or that computed from the original image sequence u and the ˆ as result of computation u n(x, t) = |ud (x, t) − um (x, t)|
(38)
276
K. Kashu, A. Imiya, and T. Sakai
(b) Pyramidtransformed
(a) Original
(c) Super-resolution
Fig. 1. Coronal, transverse, and sagittal planes of beating heart images. (a) the original image. (b) the pyramid transformed image. (c) Super-resolution form the pyramid transformed image. Our algorithm computes volumetric optical flow of the original image sequence from the pyramid-transformed image sequence. Table 1. Dimension of 3D image sequences Sequence width height depth frames Beating Heart 256 256 75 20 Lung 166 195 25 13 Table 2. Parameters for computation K λ Δτ κ 1 0.5 0.1 2−4
and their average in each frame as N (t) = |ud (x, t)) − um (x, t)|dxdy.
(39)
R3
Tables 1 and 2 illustrate parameters for numerical computation. In Figs. 1 and 2, (a), (b), and (c) are the coronal, transverse, and sagittal slices of the original, pyramid-transformed, and super-resolution images of the beating heart MRI images and Lung images. Super-resolution volumetric optical-flow computation estimates the spatiotemporal motion filed of the images of (a) from a low-resolution images of (b). For the comparison, we show the result of variational super-resolution in (c). The result of super-resolution is smoothed and blurred comparing to the original image in (a).
3D Super-Resolution
(b) Pyramidtransformed
(a) Original
277
(c) Super-resolution
50 45 40 35 30 25 20 15 10
norm error
14 13.5 13 12.5 12 11.5 11 10.5 10
angle error [deg]
gain
Fig. 2. Coronal, transverse, and sagittal planes Lung image obtained from http://www.vision.ee.ethz.ch/4dmri/. (a) the original image. (b) the pyramid transformed image. (c) super-resolution form the pyramid transformed image. Our algorithm computes volumetric optical flow of the original image sequence from the pyramidtransformed image sequence.
2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2
0 2 4 6 8 10 12 14 16 18 20 frame number
0 2 4 6 8 10 12 14 16 18 20 frame number
(a) gain(um , ud ) for α = 1.5
(b) AAE(um , ud ) for α = 1.5
(c) ANE(um , ud ) for α = 1.5
45 40 35 30 25 20 15 10
norm error
14 13 12 11 10 9 8 7 6
angle error [deg]
gain
0 2 4 6 8 10 12 14 16 18 20 frame number
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2
0 2 4 6 8 10 12 14 16 18 20 frame number
0 2 4 6 8 10 12 14 16 18 20 frame number
0 2 4 6 8 10 12 14 16 18 20 frame number
(d) gain(um , ud ) for α = 2.5
(e) AAE(um , ud ) for α = 2.5
(f) ANE(um , ud ) for α = 2.5
Fig. 3. α = 1.5, 2.5 for Beating heart MRI sequence
Figures 3 and 4 show the evaluation of the gain, the angle error (AAE), and the norm error (ANE) of the real image sequences. These results show that the method effectively compute volumetric optical flow from low resolution image sequence.
K. Kashu, A. Imiya, and T. Sakai
0
2
4 6 8 10 12 14 frame number
2
4 6 8 10 12 14 frame number
(d) gain(um , ud )
2
26 24 22 20 18 16 14 12 10 8 6 0
2
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
4 6 8 10 12 14 frame number
0
(b) AAE(um , ud )
angle error [deg]
gain
4 3.5 3 2.5 2 1.5 1 0.5
norm error 0
(a) gain(um , ud )
0
26 24 22 20 18 16 14 12 10 8 6 4
4 6 8 10 12 14 frame number
(e) AAE(um , ud )
2
4 6 8 10 12 14 frame number
(c) ANE(um , ud )
norm error
5.6 5.4 5.2 5 4.8 4.6 4.4 4.2 4 3.8
angle error [deg]
gain
278
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
2
4 6 8 10 12 14 frame number
(f) ANE(um , ud )
Fig. 4. α = 1.5, 2.5 for lung MRI sequence
7
Conclusions
We have developed an algorithm for super-resolution optical flow computation, which computes the optical flow vectors on a sequence of high-resolution images from a sequence of low-resolution images, and have shown the convergence property of the algorithm. Our algorithm directly computes the optical flow field of a high-resolution image from the spatial gradient and the temporal derivative of the low-resolution images by combining variational super-resolution and variational optical flow computation. This research was supported by ”Computational anatomy for computer-aided diagnosis and therapy: Frontiers of medical image sciences” funded by Grant-inAid for Scientific Research on Innovative Areas, MEXT, Japan, Grants-in-Aid for Scientific Research founded by Japan Society of the Promotion of Sciences and Grant-in-Aid for Young Scientists (A), NEXT, Japan.
References 1. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optic flow computation with theoretically justified warping. IJCV 67, 141–158 (2006) 2. Davis, J.A., Smith, D.A., McNamara, D.E., Cottrell, D.M., Campos, J.: Fractional derivatives-analysis and experimental implementation. Applied Optics 32, 5943– 5948 (2001)
3D Super-Resolution
279
3. Tseng, C.-C., Pei, S.-C., Hsia, S.-C.: Computation of fractional derivatives using Fourier transform and digital FIR differentiator. Signal Processing 80, 151–159 (2000) 4. Blu, T., Unser, M.: Image interpolation and resampling. In: Handbook of Medical Imaging, Processing and Analysis, pp. 393–420. Academic Press, London (2000) 5. Stark, H. (ed.): Image Recovery: Theory and Application. Academic Press, New York (1992) 6. Wahba, G., Wendelberger, J.: Some new mathematical methods for variational objective analysis using splines and cross-validation. Monthly Weather Review 108, 36–57 (1980) 7. Pock, T., Urschler, M., Zach, C., Beichel, R.R., Bischof, H.: A duality based algorithm for TV-L1 -optical-flow image registration. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 511–518. Springer, Heidelberg (2007) 8. Marquina, A., Osher, S.J.: Image super-resolution by TV-regularization and Bregman iteration. Journal of Scientific Computing 37, 367–382 (2008) 9. Chambolle, A.: An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision 20, 89–97 (2004) 10. Youla, D.: Generalized image restoration by the method of alternating orthogonal projections. IEEE Transactions on Circuits and Systems 25, 694–702 (1978) 11. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–204 (1981) 12. Burt, P.J., Andelson, E.H.: The Laplacian pyramid as a compact image coding. IEEE Trans. Communications 31, 532–540 (1983) 13. Hwan, S., Hwang, S.-H., Lee, U.K.: A hierarchical optical flow estimation algorithm based on the interlevel motion smoothness constraint. Pattern Recognition 26, 939– 952 (1993) 14. Shin, Y.-Y., Chang, O.-S., Xu, J.: Convergence of fixed point iteration for deblurring and denoising problem. Applied Mathematics and Computation 189, 1178– 1185 (2007) 15. Beauchemin, S.S., Barron, J.L.: The computation of optical flow. ACM Computer Surveys 26, 433–467 (1995)
Architectural Style Classification of Building Facade Windows Gayane Shalunts1, , Yll Haxhimusa2 , and Robert Sablatnig1 1
Vienna University of Technology Institute of Computer Aided Automation Computer Vision Lab {shal,sab}@caa.tuwien.ac.at 2 Vienna University of Technology Institute of Computer Graphics and Algorithms Pattern Recongition and Image Processing Lab [email protected]
Abstract. Building facade classification by architectural styles allows categorization of large databases of building images into semantic categories belonging to certain historic periods, regions and cultural influences. Image databases sorted by architectural styles permit effective and fast image search for the purposes of content-based image retrieval, 3D reconstruction, 3D city-modeling, virtual tourism and indexing of cultural heritage buildings. Building facade classification is viewed as a task of classifying separate architectural structural elements, like windows, domes, towers, columns, etc, as every architectural style applies certain rules and characteristic forms for the design and construction of the structural parts mentioned. In the context of building facade architectural style classification the current paper objective is to classify the architectural style of facade windows. Typical windows belonging to Romanesque, Gothic and Renaissance/Baroque European main architectural periods are classified. The approach is based on clustering and learning of local features, applying intelligence that architects use to classify windows of the mentioned architectural styles in the training stage.
1
Introduction
Architectural styles are phases of development that classify architecture in the sense of historic periods, regions and cultural influences. Each architectural style defines certain forms, design rules, techniques and materials for building construction. As architectural styles developed from one another, they contain similar elements or modifications of the elements from the earlier periods. An automatic system for classification of building facade images by architectural styles will allow indexing of building databases into categories belonging to certain historic periods. This kind of a semantic categorization limits the search
Supported by the Doctoral College on Computational Perception.
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 280–289, 2011. c Springer-Verlag Berlin Heidelberg 2011
Architectural Style Classification of Building Facade Windows
281
Fig. 1. Different architectural styles in St. Charles’s Church in Vienna
of building image databases to certain category portions for the purposes of building recognition [1, 2], Content Based Image Retrieval (CBIR) [3], 3D reconstruction, 3D city-modeling [4] and virtual tourism [5]. Architectural style classification system may also find its application in tourism, if provided with smart phones. To the best knowledge of the authors there is no automatic system for classification of building facade images by architectural styles. Building facade images from online image databases either do not have any labels related architectural styles or such labels are inaccurate. If the observer does not have the knowledge how to classify architectural styles, he/she should search for the name of the image building and thus find out the architectural style of the mentioned building. If the building image does not have any annotations, it is impossible for the observer to find out to which architectural style the building belongs to. An automatic system for classification of architectural styles will solve this task. Architectural style classification of the whole building is viewed as a voting mechanism of separate architectural elements, such as windows, domes, towers, columns, etc. This approach allows facade architectural style classification by a single structuring element, for example a window, in case of partly occluded facades. It is also appropriate for facades which are a mixture of architectural styles. In case of voting for different architectural styles by different architectural elements, the more important architectural elements are given heavier weights while voting. A typical example of a building designed in different architectural styles is St. Charles’s Church in Vienna (Fig. 1), which includes Roman columns, a Classic columned portico and a Baroque dome. In this case the dome should be given a heavier weight than the columns and portico, as St. Charles’s Church is considered a Baroque church. In the scope of facade architectural style classification task by a voting mechanism of structural elements, the current paper focuses on classification of typical facade windows of the main European proceeding architectural styles:
282
G. Shalunts, Y. Haxhimusa, and R. Sablatnig
a) Single arch
b) Double arch
c) Triple arch
Fig. 2. Romanesque windows
– – – –
Romanesque (1050 A.D. - 1200 A.D.) Gothic (1150 A.D. - 1500 A.D.) Renaissance (1420 A.D. - 1550 A.D.) Baroque (1550 A.D. - 1750 A.D.)
As there are methods like [6–8] for detection of windows on building facades, the current paper operates on an image database of bounding boxes of windows. Our approach is based on the fact that each architectural style applies certain geometrical rules for style typical window construction. This means that certain gradient directions are dominating in each window class. The methodology is based on clustering and learning of the local features to find out the image dominant gradient directions and thus categorize the classes of different architectural styles. Our system yields a classification rate of 95.16% while categorizing 3 architectural styles and 8 intra-class types. The paper is organized as follows: Section 2 shows typical windows of Romanesque, Gothic, Renaissance/Baroque architectural styles which are classified. Section 3 explains the chosen method for the classification of the mentioned window types. The experiments and results of the classification are presented in Section 4. And finally Section 5 concludes the paper.
2
Typical Windows of the Classified Architectural Styles
For architectural style classification of windows typical window examples of Romanesque, Gothic, Renaissance/Baroque architectural periods are chosen. The characteristic feature of Romanesque windows is the single, double or triple round arch (Fig. 2a, b and c respectively), while Gothic style is very distinct with pointed arches (Fig. 3a) and rose windows (Fig. 3b). For Baroque style window decorations like triangular and segmental pediments (Fig. 4a and b respectively) and balustrades (Fig. 4c) are characteristic. As Baroque evolved from Renaissance windows with triangular, segmental pediments and balustrades are also present on Renaissance buildings. In the case of
Architectural Style Classification of Building Facade Windows
a) Gothic pointed arch
283
b) Gothic rose
Fig. 3. Gothic windows
the mentioned window types other features should be taken into account to differ between Baroque and Renaissance styles. Such features may be depth information, as Renaissance is considered ’planar classicism’ and Baroque as ’sculpted classicism’ or the analysis of the whole building facade structure. Our method overall catigorizes 3 window classes: – Romanesque – Gothic – Baroque and 8 intra-class types - Romanesque single, double and triple round arch windows, Gothic pointed arch and rose windows, Baroque windows with triangular, segmental pediments and balustrades. We classify between the 3 stated architectural classes, but not the 8 intra-class types, as our objective is architectural style classification. In the scope of architectural style classification task it should be mentioned about architectural revivalism, which is a phenomenon of imitation of past architectural styles. The singularity of 19th century revivalism, as compared with earlier revivals, was that it revived several kinds of architecture at the same time [9]. These revived styles are also referred to as neo-styles, e.g. Gothic Revival is also referred to as neo-Gothic. Our approach does not differ between original and revival architectural styles, as only visual information is not enough for such a discrimination. Additional information related building date, location and materials is needed to differ between original and revival architectural styles.
3
Bag of Words for Facade Window Classification
The task of classification of windows by architectural styles is highly complex, because of the high intra-class diversity as well as reflections present in window images. One can use different texture features [10, 11] as well as established shape descriptors [12, 13]. In this work we use a local feature-based approach, since it
284
G. Shalunts, Y. Haxhimusa, and R. Sablatnig
a) Triangular pediment
b) Segmental pediment
c) Balustrade
Fig. 4. Baroque windows
Learning
incorporates texture and gradients into an image descriptor. It is shown in [14] that shapes can be represented by local features (peaks and ridges). Since on window shapes of each class certain gradient directions are dominating, we use local features to describe shapes. One can use different local features, like HarrisLaplacian corner detectors [15, 16], difference of Gaussians corner detectors [17] or detectors based on regions [18, 19] and local image descriptors [17–19]. The goal is to extract characteristic gradient directions, like those describing pointed arch or triangular pediment (Fig. 2,3,4) and to minimize the influence of nonrelevant features, like those from reflections and curtains. Classifying windows will be preceded by a method to classify facades, thus we choose the standard bag of words approach presented by Csurka et al [20] (Fig. 5). In the learning phase the Scale Invariant Feature Transform(SIFT) [17] is used to extract the information of gradient directions. After performing the difference of Gaussians on different octaves and finding minimas/maximas, i.e. finding interest points, we only perform rejection of interest points with low contrast by setting a low threshold. All interest points that lie on window edges
Images (Data Set)
Local image features and descriptors (e.g. SIFT)
Images (Queries)
Clustering (e.g. k-means)
Local image features and descriptors (e.g. SIFT)
Visual Words (Codebook)
Representing Images by Histograms
Category models & Classifiers
Representing Images by Histograms
Category decision
Architectural Style
Fig. 5. Learning visual words and classification scheme
Architectural Style Classification of Building Facade Windows 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
10
20
30
40
50
60
70
80
0
90
0
10
20
a) Romanesque (Fig. 2a)
30
40
50
60
70
80
285
90
b) Gothic (Fig. 3a)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
10
20
30
40
50
60
70
80
90
c)Baroque (Fig. 4c) Fig. 6. Histograms of visual words for the images of different window styles
are kept. Note that we do not follow the original work [17] in this step, i.e. we do not suppress the response of the filter along the edges. After finding the interest points we proceed to finding local image descriptors (SIFT image descriptors) and normalizing them. The number of local features is large, thus we use clustering to learn a visual vocabulary (codebook). The codebook of separate classes is made by searching for the visual cluster centers using unsupervised k-means algorithm. The codebook is learnt on a training set. The classification of a query image follows similar steps (Fig. 5). After extracting local image features and descriptors, the histogram representation is built up by using the codebook learnt on the training stage (Fig. 6). Our category model is simple: it is the sum of all histogram responses for each class (integrated response). As our category model yields acceptable results (Sec. 4), we refrain from using a classifier for building a model. The image window class is determined by finding the maxima of integrated responses of the three classes. For example, for the histogram representation shown Fig. 6a, the sum of all responses of Romanesque class is 5.6038, Gothic class – 1.8868 and Baroque class – 2.3019. Thus the image is classified as Romanesque. The histograms shown in Fig. 6 are built using a
286
G. Shalunts, Y. Haxhimusa, and R. Sablatnig 100
95 k = 25 k = 30
90
k = 35 k = 40
85
80 0,01
k = 45
0,02
0,03
0,04
0,05
Fig. 7. Classification accuracy. Finding the best size of codebook (k) and SIFT peak threshold (p – horizontal axes). Table 1. Classification accuracy on the training set with different codebook sizes Peak Threshold (p) 0,01 0,02 0,03 0,04 0,05
k = 25 85,56 88,89 92,22 87,78 84,44
k = 30 91,11 93,33 96,67 88,89 92,22
k = 35 94,44 93,33 96,67 96,67 93,33
k = 40 92,22 95,56 97,78 93,33 93,33
k = 45 90,00 97,78 98,89 94,44 91,11
codebook of 30 cluster centers for each class. Note that for Romanesque class histogram high responses are located on the bins from 61 to 90, for Gothic class - from 1 to 30 and for Baroque class - from 31 to 60. The category model based on the maxima of the integraged class responses proves to be effective, as it makes the vote for the right class strong by integration of the high responses and suppresses the false class peaks, which may occur due to irrelevent descriptors located on architectural details, reflections and curtains.
4
Experiments of Window Classification and Discussion
To the best knowledge of the authors there is no image database labeled by architectural styles. For testing and evaluation of our methodology we created a database of 400 images, 351 of which belong to our own and the rest - to Flickr1 image datasets. 90 images of the database make the training set (1/3 of each class). The resolution range of the images is from 138 × 93 to 4320 × 3240 pixels. To evaluate the issue of the codebook size (vocabulary size) we have perfomed an experiment with different codebook sizes (k) (Tab. 1 and Fig. 7). The value of 1
http://www.flickr.com
Architectural Style Classification of Building Facade Windows
287
Table 2. Confusion matrix and the accuracy rate in parenthesis Gothic Baroque Romanesque Sum Gothic 100 (98.1%) 1 1 102 Baroque 3 111 (92.5%) 6 120 Romanesque 1 3 84 (95.4%) 88 Sum 104 115 91 310
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
a) Baroque window
0
10
20
30
40
50
60
70
80
90
100
b) Histogram
Fig. 8. False classification of Baroque into Gothic window
peak threshold for SIFT feature extraction and the value of k for k-means clustering algorithm are searched so that the final classification rate is maximised on the training set. As it is obvious from Fig. 7, SIFT peak threshold values larger than 0.03 decrease the classification rate. The reason for this is that the extraction of a bigger number of SIFT descriptors than that with peak threshold value equal to 0.03 tends to extract descriptors located on window reflections and backgound construction material textures, i.e. we are overfitting. Whereas peak threshold values smaller than 0.03 decrease the number of extracted SIFT descriptors describing the dominating gradients characteristic for each window class. Fig. 7 also shows that the best choice for k-means algorithm k parameter is in the range 25−45. We choose to take k = 30. The k parameter values smaller than 25 decrease the classification rate, as the number of cluster centers is not enough for the discrimination of visual words of different classes. Whereas values higher than 45 make the image histograms sparser, i.e. we get non-representative visual words. Our final codebook choice for testing the system is the one corresponding to k = 30 and peak threshold equal to 0.03. Running the classification with the mentioned codebook on a testing dataset of 310 images results in 15 false classified images, which yields an average classification rate of 95.16%. A confusion matrix, with true positives, is given in the Tab. 2. In the Fig. 8 it is shown an example of a false classification of a Baroque window into Gothic. The sum of all responses of Romanesque class is 8.4324,
288
G. Shalunts, Y. Haxhimusa, and R. Sablatnig
Gothic class – 9.6757 and Baroque class – 8.7568. Therefore this image is classified as Gothic, since the maximum response is 9.6757. The reason for the false classification is the high complexity of architectural details and curtains. As our approach uses SIFT features for classification, it is rotation and scale invariant [17]. The experiments also prove that the approach is camera viewpoint invariant, as the classification of windows is accurate under high perspective distortions.
5
Conclusion
Virtual tourism, 3D building reconstruction, 3D city-modeling, CBIR and indexing of cultural heritage buildings operate on large image databases. The classification of such building databases into semantic categories belonging to certain architectural styles limits the image search of the whole databases to semantic portions. Also smart phones equipped with a building architectural style classification system may be applicable in the field of real tourism. A method for window classification of Romanesque, Gothic and Renaissance/ Baroque European main architectural styles was presented. In the scope of facade architectural style classification task by a voting mechanism of structural elements, like windows, domes, towers, columns, etc., the current paper purpose was to classify the architectural style taking into account only windows. Our approach is based on clustering and learning of local features. The experiments prove that the proposed approach yields a high classification rate. Future work in the context of architectural style classification of building facades includes analysis of the images, which had a false classification due to high complexity of architectural details and curtains in order to eliminate false classifications, classification of windows on complete facade images, classification of other building structural elements, raising the number of classified architectural styles, use of symmetry feature descriptors and realization of a voting mechanism of different structural elements. The proposed methodology can be used for architectural style classification of other structural parts, like domes, towers, columns, etc.
References 1. Zheng, Y.T., Zhao, M., Song, Y., Adam, H., Buddemeier, U., Bissacco, A., Brucher, F., Chua, T.S., Neven, H.: Tour the world: building a web-scale landmark recognition engine. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, pp. 1085–1092 (2009) 2. Zhang, W., Kosecka, J.: Hierarchical building recognition. Image and Vision Computing 25(5), 704–716 (2004) 3. Li, Y., Crandall, D., Huttenlocher, D.: Landmark classification in large-scale image collections. In: Proceedings of IEEE 12th International Conference on Computer Vision, pp. 1957–1964 (2009) 4. Cornelis, N., Leibe, B., Cornelis, K., Gool, L.V.: 3d urban scene modeling integrating recognition and reconstruction. International Journal of Computer Vision 78, 121–141 (2008)
Architectural Style Classification of Building Facade Windows
289
5. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. ACM Transaction on Graphics 25, 835–846 (2006) 6. Ali, H., Seifert, C., Jindal, N., Paletta, L., Paar, G.: Window detection in facades. In: 14th International Conference on Image Analysis and Processing (ICIAP 2007). Springer, Heidelberg (2007) 7. Recky, M., Leberl, F.: Windows detection using k-means in cie-lab color space. In: ¨ Unay, D., C ¸ ataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 356–360. Springer, Heidelberg (2010) 8. Recky, M., Leberl, F.: Window detection in complex facades. In: European Workshop on Visual Information Processing (EUVIP 2010), pp. 220–225 (2010) 9. Collins, P.: Changing Ideals in Modern Architecture, pp. 1750–1950. McGillQueen’s University Press (1998) 10. Ojala, T., Pietikinen, M., M¨ aenp¨ aa ¨, T.: Multiresolution grayscale and rotation invariant texture classification with local binary patterns. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 971–987 (2002) 11. Haralick, R.M.: Statistical and structural approaches to texture. Proc. IEEE 67, 786–804 (1979) 12. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognition 37, 1–19 (2004) 13. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 509–522 (2002) 14. Crowley, J.L., Parker, A.C.: A representation for shape based on peaks and ridges in the difference of lowpass transform. IEEE Trans. on Pattern Analysis and Machine Intelligence 6(2), 156–170 (1984) 15. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of The Fourth Alvey Vision Conference, pp. 147–151 (1998) 16. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: Internationl Conference in Computer Vision, pp. 525–531 (2001) 17. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 18. Matas, J., Chum, O., Urban, M., Pajdla1, T.: Robust wide baseline stereo from maximally stable extremal regions. In: British Machine Vision Conference, pp. 384–393 (2002) 19. Tuytelaars, T., Gool, L.V.: Wide baseline stereo matching based on local, affinely invariant regions. In: British Machine Vision Conference, pp. 412–425 (2000) 20. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: ECCV 2004, pp. 1–22 (2004)
Are Current Monocular Computer Vision Systems for Human Action Recognition Suitable for Visual Surveillance Applications? Jean-Christophe Nebel, Michał Lewandowski, Jérôme Thévenon, Francisco Martínez, and Sergio Velastin Digital Imaging Research Centre, Kingston University, London Kingston-Upon-Thames, KT1 2EE, UK {J.Nebel,M.Lewandowski,J.Thevenon,F.Martinez,S.Velastin} @kingston.ac.uk
Abstract. Since video recording devices have become ubiquitous, the automated analysis of human activity from a single uncalibrated video has become an essential area of research in visual surveillance. Despite variability in terms of human appearance and motion styles, in the last couple of years, a few computer vision systems have reported very encouraging results. Would these methods be already suitable for visual surveillance applications? Alas, few of them have been evaluated in the two most challenging scenarios for an action recognition system: view independence and human interactions. Here, first a review of monocular human action recognition methods that could be suitable for visual surveillance is presented. Then, the most promising frameworks, i.e. methods based on advanced dimensionality reduction, bag of words and random forest, are described and evaluated on IXMAS and UT-Interaction datasets. Finally, suitability of these systems for visual surveillance applications is discussed.
1 Introduction Nowadays, video surveillance systems have become ubiquitous. Those systems are deployed in various domains, ranging from perimeter intrusion detection, analysis of customers’ buying behaviour to surveillance of public places and transportation systems. Recently, the acquisition of activity information from video to describe actions and interactions between individuals has been of growing interest. This is motivated by the need for action recognition capabilities to detect, for example, fighting, falling or damaging property in public places since the ability to alert security personnel automatically would lead to a significant enhancement of security in public places. In this paper, we review human action recognition systems which have been evaluated against datasets relevant to video surveillance, i.e. approaches that are designed to operate with monocular vision and that would function regardless of the individual camera perspective the action is observed at. Further, we evaluate three of the most promising approaches on both view independent and human interaction scenarios. Finally, we conclude on their suitability for video surveillance applications (VSA). G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 290–299, 2011. © Springer-Verlag Berlin Heidelberg 2011
Are Current Monocular Computer Vision Systems
291
2 Review The KTH [15] and Weizzman [36] databases have been used extensively for benchmarking action recognition algorithms. However, not only do they no longer constitute a challenge to the most recent approaches, but they do not possess the required properties to evaluate if a system is suitable for VSA. Ideally, such dataset should be able to test systems on view independent scenarios involving human interactions. Although no dataset combines such level of complexity with sufficient data to train machine learning algorithms, IXMAS [33] is view independent and UT-Interaction [27] offers a variety of interactions between two characters. A few approaches have been evaluated on view independent scenarios. Accurate recognition has been achieved using multi-view data with either 3D exemplar-based HMMs [34] or 4D action feature models [37]. But, in both cases performance dropped significantly in a monocular setup. This was addressed successfully by representing videos using self-similarity based descriptors [12]. However, this technique assumes a rough localisation of the individual of interest which is unrealistic in many VSA. Similarly, the good performance of a SOM based approach using motion history images is tempered by the requirement of segmenting characters individually [23]. Three approaches have produced accurate action recognition from simple extracted features and could be suitable in a VSA context: two of them rely on a classifier, either SVM [20] or Maximisation of Mutual Information [13], trained on bags of words and the other one is based on a nonlinear dimensionality reduction method designed for time series [19]. Unfortunately none of these techniques has been tested with interactions. Actually, only one approach, which relies on a classifier based on a random forest [32], has been reported to tackle the Ut-Interaction dataset. However, its ability to handle view independent scenarios is unknown. This review on human action recognition systems demonstrates the dynamism of the field. However, it also highlights that currently no approach has been evaluated on the two most relevant and challenging scenarios for a visual surveillance system: view independence and human interactions. In this study, the three action recognition approaches with the most potential to tackle successfully those scenarios, i.e. advanced dimensionality reduction, bag of words and random forest, are implemented and evaluated.
Fig. 1. Training frameworks of the three methods of interest
292
J.-C. Nebel et al.
3 Promising Approaches 3.1 Temporal Extension of Laplacian Eigenmaps Action recognition usually relies on associating a high dimensional video descriptor with an action class. In order to make this classification task more manageable, frameworks based on dimensionality reduction techniques have been proposed [1, 3, 6, 10, 18, 26, 30, 31]. However, they cannot handle large variations within a dataset such as an action performed by different people and, therefore, fail to capture the intrinsic structure of an action. To deal with this fundamental issue, a Temporal extension of Laplacian Eigenmaps (TLE) has been recently proposed [19]. TLE is an unsupervised nonlinear method for dimensionality reduction designed for time series data. It aims not only to preserve the temporal structure of data describing a phenomenon, e.g. a specific action, but also to discard the ‘stylistic’ variation found in different instances of that phenomenon, e.g. different actors performing a given action. First, time series data representing a given phenomenon are locally aligned in the high dimensional space using dynamic time warping [25]. Then, two types of constraints are integrated in the standard Laplacian Eigenmaps framework [39]: preservation of temporal neighbours within each time series, and preservation of local neighbours between different time series as defined by their local alignment. Within the context of action recognition, TLE is used to produce a single generic model for each action seen from a given view [19]. As shown on the first row of Fig. 1, this is achieved by, first, extracting characters’ silhouettes from each frame of a video to produce a 3D silhouette. Then, video descriptors are produced for the 3D salient points detected using the solutions of the Poisson’s equation [8]. Finally, TLE is applied to all video descriptors associated to a given action in order to produce an action manifold of dimension 2. Once action manifolds have been produced for each action of interest, action recognition is achieved by projecting the video descriptors of the video to classify in each action manifold. Then, the dynamic time warping metric [25] is used to establish which action descriptor describes best the video of interest. In a view-independent action recognition scenario, this scheme needs to be extended. In principle, a different action manifold can be produced for every view of interest. However, if training data are available in the form of an action visual hull [33], a unique manifold of dimension 3 can be built to model an action independently from the view [18]. 3.2 Bag of Words Bag of Words (BoW) is a learning method which was used initially for text classification [11]. It relies on, first, extracting salient features from a training dataset of labelled data. Then, these features are quantised to generate a code book which provides the vocabulary in which data can be described. This approach has become a standard machine learning tool in computer vision and in the last few years, action recognition frameworks based on Bags of Words have become extremely popular [4, 7, 9, 14, 21, 22, 24, 28, 29]. Their evaluation on a variety of datasets including film-based ones [17] demonstrates the versatility of these approaches.
Are Current Monocular Computer Vision Systems
293
In this study, we based our implementation on that proposed by [5]. As shown on the second row of Fig. 1, first, an action bounding box is extracted from each video frame to produce a 3D action bounding box. Then salient feature points are detected by a spatio-temporal detector (Harris 3D) and described by a histogram of optical flow (STIP) [16]. Once feature points are extracted from all training videos, the k-means algorithm is employed to cluster them into k groups, where their centres are chosen as group representatives. These points define the codebook which is used to describe each video of the training set. Finally, those video descriptors are used to train an SVM classifier with a linear kernel. In order to recognise the action performed in a video, the associated STIP based descriptor is generated. Then it is fed into the SVM classifier, which labels the video. 3.3 Random Forest In 2001, Breiman introduced the concept of random forests which are defined as “a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest” [2]. This machine learning approach has the appealing property that random forests do not overfit when more trees are added, but converge towards a specific generalisation error. In the last couple of years, this new scheme has been exploited to classify human actions using a Hough transform voting framework [38] and [32]. First, densely-sampled feature patches based on gradients and optical flow are produced. Then, random trees are trained to learn a mapping between these patches and their corresponding values in a spatiotemporal-action Hough space. Finally, a voting process is used to classify actions. The third row of Fig. 1 summarises our implementation which follows [38]. First, 3D action bounding boxes are generated for all training videos. Secondly, 5000 random 3D patches of size 16x16x5 are extracted from each box to produce video descriptors. Patches are described by 8 low-level features, i.e. Lab colour space, absolute value of the gradients in x, y and time and optical flow in x and y, and their relative spatiotemporal position from the centre of the bounding box. Then, video descriptors and labels are used to generate a random forest comprised of 5 trees [38]. Each node of the binary decision trees is built by choosing randomly a binary test, minimising the average entropy of the patches arriving at the node and splitting the training patches according to the test results. A random binary test compares the values of two randomly selected pixels in a patch according to a randomly selected feature. The process of action recognition relies on producing an exhaustive set of patches from the video of interest and passing them through each tree of the forest. Decisions reached by each patch in each tree are then collected and used to vote for the label to attribute to the video.
4 Performance on View Independent Scenario 4.1 Dataset and Experimental Setup The publicly available multi-view IXMAS dataset is considered as the benchmark for view independent action recognition methods [33]. It is comprised of 13 actions,
294
J.-C. Nebel et al.
performed by 12 different actors. Each activity instance was recorded simultaneously by 5 calibrated cameras (4 side and 1 top views), and a reconstructed 3D visual hull is provided. Since no specific instruction was given to actors regarding their position and orientation, action viewpoints are arbitrary and unknown. Although this dataset has been used in the past in the context of action recognition from multiple cameras, i.e. several views were used to make a final decision regarding the action class [18, 20, 34, 37], here only 1 camera view is used in the testing stage to classify an action. Sequences of object descriptors (i.e. silhouettes or bounding boxes) for each acquired view are provided for each segmented action. To generate a view independent manifold for the TLE approach, the animated visual hulls are projected onto 12 evenly spaced virtual cameras located around the vertical axis of the subject [18]. In line with other evaluations [18, 20, 37], the poorly discriminative top view data were discarded. As usual on this dataset, experiments are conducted using the leaveone-actor-out strategy. In each run, one actor is selected for testing and all data which do not involve that actor are used for training. Then, all actions performed for that actor are evaluated independently for each of the 4 views. This process is then repeated for each actor. Finally, the average accuracy obtained under this scheme is calculated (see Table 1). Note that whereas TLE and RF used default parameters, performance for BoW is shown with the size of the code book and the margin of the SVM classifier optimised for a specific dataset. 4.2 Results Table 1 displays for each approach the nature of its input feature, its average accuracy and its processing time per frame on a workstation with a single 3GHz cpu and 9GB of RAM. In addition, we include performance reported for an action recognition method based on an extension of BoW where a dense grid is used instead of salient points [35]. In terms of accuracy, TLE performs best, achieving a performance which is lower than the state of the art [35]. Fig. 2 shows the associated confusion matrix which highlights that classification errors tend to occur only between similar actions, e.g. punch and point. RF results are quite poor: it seems to suffer more from low resolution data than BoW. Whereas the number of BoW descriptors decreases with low resolution data, their intrinsic quality remains high since they are based on salient points. On the other hand, the random process which is used to select patches produces RF descriptors whose informative value degrades with image resolution. Table 1. Performances obtained on IXMAS dataset TLE Input Accuracy Processing time Training Testing
BoW
Silhouettes 73.2%
63.9%
3.8s 215s
0.42s 0.42s
RF
Bounding boxes Grid[35] ~85% 54.0% NA
5.03s 1.65s
Are Current Monocular Computer Vision Systems
295
Fig. 2. Confusion matrix obtained with TLE
Although our TLE implementation was developed using Matlab, whereas the others relied on C++, this does not explain its extremely slow processing time during the recognition phase. In fact, recognition is based on discovering the best fitting of the projection of the video descriptor on continuous 3D action models. This relies on an optimisation procedure which is particularly computationally expensive since it attempts to identify the optimal view for each class manifold. On the other hand, BoW is much faster since it only requires the classification of extracted features using a linear SVM classifier.
5 Performance on Interaction Scenario 5.1 Dataset and Experimental Setup The UT-Interaction dataset was released for the High-level Human Interaction Recognition Challenge [27]. This dataset is currently the most complete in terms of actions involving interactions and size to train algorithms. All videos are captured from a single view and show interactions between two characters seen sideways. It is composed of 2 parts (Dataset 1 & 2) with different character’s resolution (260 against 220 pixels) and background (Dataset 1’s is more uniform). Since only sequences of action bounding boxes are provided, silhouettes needed to be generated. A standard foreground extraction method was used and its output was cropped using the available action bounding boxes. Experiments were conducted using two different evaluation schemes: leave-one-out cross validation where 90% of Dataset 1 (D1), respectively Dataset 2 (D2), was used for training and the remaining 10% of the same dataset were used for testing; and a strategy where one dataset is used for training (Tr) and the other one for testing (Te). In addition, in order to evaluate the impact on BoW of the selection of the code book size and SVM margin, accuracy was also measured on D1 for various values of those two parameters.
296
J.-C. Nebel et al.
5.2 Results Performances are displayed in Table 2. Processing time per frame was measured for experiment D1 on the workstation described in Section 4.3. In addition, we include accuracy reported for an action recognition method based on an extension of RF where a tracking framework is used to produce one bounding box per character involved in the action [32]. Such scheme allows performing action recognition on each character separately and then combining that information to predict the nature of the interaction. It is the current state of the art on this dataset. BoW performs well with accuracy values similar to those reported in the state of the art [32] despite a much simpler feature input. The associated confusion matrix on Fig. 3 reveals as previously the difficulty of classifying the punch and point actions. Further results (not shown) highlight the reliance of BoW on the appropriate selection of parameters: accuracy varies within a very wide range, i.e. 45-75%, depending on the values of code book size and SVM margin. In this scenario, although TLE had to be operated with suboptimal silhouettes (in particular in D2 where the more complex background degrades performance of foreground extraction), it still performs well. Since RF relies on HOG features, which are position-dependent, its accuracy is quite poor when a unique bounding box is used for a whole action. On the other hand, as [32] showed, the availability of a box per character allows the optimal utilisation of RF in this scenario. In terms of processing time, BoW confirms its real-time potential. TLE is still slow, but its testing time is significantly faster than previously since the view is known. Table 2. Performances obtained on UT-Interaction dataset
Input Accuracy D1 D2 TrD1-TeD2 TrD2-TeD1 Processing time Training Testing
TLE Silhouettes
BoW
74.6% 66.7% 75.0% 61.0%
78.3% 80.0% 73.3% 61.7%
10.5s 9.7s
0.25s 0.13s
RF Bounding boxes Tracking[32] 45% ~80% NA 30% NA NA
Fig. 3. Confusion matrix obtained with BoW for D1
NA
Are Current Monocular Computer Vision Systems
297
6 Discussion and Conclusions Performances obtained on both View Independent and Interaction Scenarios inform us on the state-of-the-art current potential regarding the usage of human action recognition methods in visual surveillance applications. First, in both sets of experiments, best performances display accuracy in the 7080% range. TLE appears to be quite consistent and able to perform at slightly lower resolution than our BoW implementation. This can be partially explained by the fact that TLE benefits from the extraction of more advanced features (i.e. silhouettes instead of bounding boxes). On the other hand, work by [35] suggests that BoW approach would perform better at lower resolution if a dense grid instead of salient point was used to produce video descriptors. The approach based on Random Forest is clearly the least accurate in its present form. Although the integration of a tracking approach should significantly improve its performances in the interaction scenario [32], automatic initialisation would be required for VSA. Moreover, poor performance with the IXMAS dataset indicates that its feature vectors are very sensitive to image resolution. This could be improved by using, for example, advanced silhouette based descriptors [8]. In terms of processing time, the approach based on TLE is slower by 2-3 orders of magnitude than that based on Bag of Words. Although Matlab is usually less computationally efficient than C++, we do not believe this explains that significant difference. TLE has a much higher intrinsic complexity which could not be reduced without fundamental changes in the approach. On the other hand, BoW clearly demonstrates real time potential. In the case of RF, it is more difficult to judge, especially as some substantial alterations are required to make it perform as well as the others. As a whole, a Bag of Words based action recognition framework appears to be currently the best choice for real-time visual surveillance applications. However, this approach relies on a set of parameters which are essential to good performance. In situations where scene’s properties are relatively stable over time, parameter values could be accurately learned during the training phase. However, generally they would need to be dynamically updated according to the actual scene environment. This is still an area which needs investigation. All approaches investigated require a segmentation of the people involved in the action either at pixel level (TLE) or bounding box (BOW and RF) levels. This is a task which is not solved yet, especially when people density in a scene is high. As of now, it is unclear how robust the action recognition approaches are concerning segmentation quality and occlusions. Furthermore, more tests would be required to evaluate how they cope with actions performed at different speeds. We conclude that neither of the approaches investigated in this paper has shown to solve the challenge of action recognition. The investigated actions were quite basic (e.g. kick, punch, pick up, hug) and in simple surroundings, and even, in such scenario, their performances are far from satisfactory.
298
J.-C. Nebel et al.
References 1. Blackburn, J., Ribeiro, E.: Human motion recognition using isomap and dynamic time warping. In: Elgammal, A., Rosenhahn, B., Klette, R. (eds.) Human Motion 2007. LNCS, vol. 4814, pp. 285–298. Springer, Heidelberg (2007) 2. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001) 3. Chin, T., Wang, L., Schindler, K., Suter, D.: Extrapolating learned manifolds for human activity recognition. In: ICIP 2007 (2007) 4. Cheng, Z., Qin, L., Huang, Q., Jiang, S., Tian, Q.: Group Activity Recognition by Gaussian Processes Estimation. In: ICPR 2010 (2010) 5. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision at ECCV 2004, pp. 1–22 (2004) 6. Fang, C.-H., Chen, J.-C., Tseng, C.-C., Lien, J.-J.J.: Human action recognition using spatio-temporal classification. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5995, pp. 98–109. Springer, Heidelberg (2010) 7. Gilbert, A., Illingworth, J., Bowden, R.: Fast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features. In: ICCV 2009 (2009) 8. Gorelick, L., Galun, M., Sharon, E., Basri, R., Brandt, A.: Shape representation and classification using the poisson equation. PAMI 28(12), 1991–2005 (2006) 9. Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., Huang, T.S.: Action Detection in Complex Scenes with Spatial and Temporal Ambiguities. In: ICCV 2009 (2009) 10. Jia, K., Yeung, D.: Human action recognition using local spatio-temporal discriminant embedding. In: CVPR 2008 (2008) 11. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features In: ECML 1998 (1998) 12. Junejo, I.N., Dexter, E., Laptev, I., Pérez, P.: Cross-view action recognition from temporal self-similarities. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 293–306. Springer, Heidelberg (2008) 13. Kaaniche, M.B., Bremond, F.: Gesture Recognition by Learning Local Motion Signatures. In: CVPR 2010 (2010) 14. Kovashka, A., Grauman, K.: Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition. In: CVPR 2010 (2010) 15. The KTH Database, http://www.nada.kth.se/cvap/actions/ 16. Laptev, I.: On Space-Time Interest Points. International Journal of Computer Vision 64(2/3), 107–123 (2005) 17. Laptev, I., Perez, P.: Retrieving Actions in Movies. In: ICCV 2007 (2007) 18. Lewandowski, M., Makris, D., Nebel, J.-C.: View and style-independent action manifolds for human activity recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 547–560. Springer, Heidelberg (2010) 19. Lewandowski, M., Martinez, J., Makris, D., Nebel, J.-C.: Temporal Extension of Laplacian Eigenmaps for Unsupervised Dimensionality Reduction of Time Series. In: ICPR 2010 (2010) 20. Liu, J., Ali, S., Shah, M.: Recognizing human actions using multiple features. In: CVPR 2008 (2008) 21. Natarajan, P., Singh, V.K., Nevatia, R.: Learning 3D Action Models from a few 2D videos for View Invariant Action Recognition. In: CVPR 2010 (2010) 22. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010)
Are Current Monocular Computer Vision Systems
299
23. Orrite, C., Martinez, F., Herrero, E., Ragheb, H., Velastin, S.A.: Independent viewpoint silhouette-based human action modeling and recognition. In: MLVMA 2008 (2008) 24. Qu, H., Wang, L., Leckie, C.: Action Recognition Using Space-Time Shape Difference Images. In: ICPR 2010 (2010) 25. Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall, Inc., Englewood Cliffs (1993) 26. Richard, S., Kyle, P.: Viewpoint manifolds for action recognition. EURASIP Journal on Image and Video Processing (2009) 27. Ryoo, M.S., Aggarwal, J.K.: Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities. In: ICCV 2009 (2009) 28. Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 536–548. Springer, Heidelberg (2010) 29. Thi, T.H., Zhang, J.: Human Action Recognition and Localization in Video using Structured Learning of Local Space-Time Features. In: AVSS 2010 (2010) 30. Turaga, P., Veeraraghavan, A., Chellappa, R.: Statistical analysis on stiefel and grassmann manifolds with applications in computer vision. In: CVPR 2008, pp. 1–8 (2008) 31. Wang, L., Suter, D.: Visual learning and recognition of sequential data manifolds with applications to human movement analysis. Computer Vision and Image Understanding 110(2), 153–172 (2008) 32. Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a hough-voting action recognition system. In: Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 306–312. Springer, Heidelberg (2010) 33. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding 104(2-3), 249–257 (2006) 34. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3d exemplars. In: ICCV 2007 (2007) 35. Weinland, D., Özuysal, M., Fua, P.: Making Action Recognition Robust to Occlusions and Viewpoint Changes. In: ECCV 2010 (2010) 36. The Weizzman Database, http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html 37. Yan, P., Khan, S., Shah, M.: Learning 4D action feature models for arbitrary view action recognition. In: CVPR 2008 (2008) 38. Yao, A., Gall, J., Van Gool, L.: A Hough Transform-Based Voting Framework for Action Recognition. In: CVPR 2010 (2010) 39. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS 14, 585–591 (2001)
Near-Optimal Time Function for Secure Dynamic Visual Cryptography V. Petrauskiene1 , J. Ragulskiene2 , E. Sakyte1 , and M. Ragulskis1 1
Research Group for Mathematical and Numerical Analysis of Dynamical Systems, Kaunas University of Technology, Studentu 50-222, Kaunas LT-51368, Lithuania 2 Kauno Kolegija, Pramones 20, Kaunas LT-50468, Lithuania
Abstract. The strategy for the selection of an optimal time function for dynamic visual cryptography is presented in this paper. Evolutionary algorithms are used to obtain the symmetric piece-wise uniform density function. The fitness function of each chromosome is associated with the derivative of the standard of the time-averaged moir´e image. The reconstructed near-optimal time function represents the smallest interval of amplitudes where an interpretable moir´e pattern is generated in the time-averaged image. Such time functions can be effectively exploited in computational implementation of secure dynamic visual cryptography.
1
Introduction
Visual cryptography is a cryptographic technique which allows visual information (pictures, text, etc) to be encrypted in such a way that the decryption can be performed by the human visual system, without the aid of computers. Visual cryptography was pioneered by Naor and Shamir in 1994 [1]. They demonstrated a visual secret sharing scheme, where an image was broken up into n shares so that only someone with all n shares could decrypt the image, while any n − 1 shares revealed no information about the original image. Each share was printed on a separate transparency, and decryption was performed by overlaying the shares. When all n shares were overlaid, the original image would appear. Since 1994, many advances in visual cryptography have been done. An efficient visual secret sharing scheme for color images is proposed in [2]. Halftone visual cryptography based on the blue noise dithering principles is proposed in [3]. Basis-matrices-free image encryption by random grids is developed in [4]. A generic method that converts a visual cryptography scheme into another visual cryptography scheme that has a property of cheating prevention is implemented in [5]. Colored visual cryptography without color darkening is developed in [6]. Extended visual secret sharing schemes have been used to improve the quality of the shadow image in [7]. Geometric moir´e [8,9] is a classical in-plane whole-field non-destructive optical experimental technique based on the analysis of visual patterns produced by superposition of two regular gratings that geometrically interfere. Examples of gratings are equispaced parallel lines, concentric circles or arrays of dots. The G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 300–309, 2011. c Springer-Verlag Berlin Heidelberg 2011
Optimal Time Function for Secure Dynamic Visual Cryptography
301
gratings can be superposed by double exposure photography, by reflection, by shadowing, or by direct contact [10,11]. Moir´e patterns are used to measure variables such as displacements, rotations, curvature and strains throughout the viewed area. Two basic goals exist in moir´e pattern research. The first is the analysis of moir´e patterns. Most of the research in moir´e pattern analysis deals with the interpretation of experimentally produced patterns of fringes and determination of displacements (or strains) at centerlines of appropriate moir´e fringes [8]. Another goal is moir´e pattern synthesis when the generation of a certain predefined moir´e pattern is required. The synthesis process involves production of two such images that the required moir´e pattern emerges when those images are superimposed [12]. Moir´e synthesis and analysis are tightly linked and understanding one task gives insight into the other. The image hiding method based on time-averaging moir´e is proposed in [13]. This method is based not on static superposition of moir´e images, but on timeaveraging geometric moir´e. This method generates only one picture; the secret image can be interpreted by the naked eye only when the original encoded image is harmonically oscillated in a predefined direction at a strictly defined amplitude of oscillation. Only one picture is generated, and the secret is leaked from this picture when parameters of the oscillation are appropriately tuned. In other words, the secret can be decoded by trial and error-if only one knows that he has to shake the slide. Therefore, additional image security measures are implemented in [13], particularly splitting of the encoded image into two shares. The image encoding method which reveals the secret image not only at exactly tuned parameters of the oscillation, but also requires that the time function determining the process of oscillation must comply with specific requirements is developed in Ref. [14]. This image hiding method based on time-averaging moir´e and non-harmonic oscillations does not reveal the secret image at any amplitude of harmonic oscillations. Instead, the secret is leaked only at carefully chosen parameters of this specific time function (when the density function of the time function is a symmetric uniform density function). The main objective of this manuscript is to propose such a time function (used to decrypt the secret image) which would ensure the optimal security of the encoded image. The security of the encoded image is measured in terms of the local variation of grayscale levels in the surrounding of a time-averaged fringe which is exploited to reveal the secret. This paper is organized as follows. Initial definitions are presented in section 2; the optimization problem is discussed in section 3; computational experiments and concluding remarks are given in section 4.
2
Initial Definitions
A one-dimensional moir´e grating is considered in this paper. We will use a stepped grayscale function defined as follows [14]:
302
V. Petrauskiene et al.
F (x) =
1 1, when x ∈ λj; 2 , λ j1+ 0, when x ∈ λ j + 2 ; λ (j + 1)
j = 0, ±1, ±2, . . .
(1)
and λ is the pitch of moir´e grating. Definition 1. Time averaging operator Hs reads [15]: 1 Hs (F ; ζs ) = lim T →∞ T
T F (x − ξs (t)) dt;
(2)
0
where t is time, T is exposure time, ξs (t) is a function describing dynamic deflection from state of equilibrium, s ≥ 0 is a real parameter; x ∈ R. Definition 2. The standard of a time-averaged grayscale grating function reads [14]: λ 1 2 σ (s) = σ (Hs (F (x) , ξs )) = (Hs (F (x) , ξs ) − E (Hs (F (x) , ξs ))) λ
(3)
0
We will consider a piece-linear function ξs (t) as a realization of ξs ; its which density function ps (x) satisfies following requirements: (i) ps (x) = 0 when |x| > s; s > 0; (ii) ps (x) = ps (−x) for all x ∈ R. We will assume that the density function ps (x) comprises 2n equispaced columns symmetrically distributed in the interval [−s; s] (Fig. 1). Due to the symmetry we will consider the vector (γ1 , γ2 , . . . , γn ) representing the right half of the density function (γi denotes the area of the ith column). Corollary 1. The Fourier transform of a piece-wise uniform density function reads: Ps (Ω) =
2n · p1 (Ω) ; Ω·s
(4)
Fig. 1. A piece-wise uniform density function comprising 2n equispaced columns. The density is described by the weight-vector (γ1 , γ2 , . . . , γn ); γi is the area of the i-th column.
Optimal Time Function for Secure Dynamic Visual Cryptography
where
303
sΩ 2sΩ p1 (Ω) = (γ1 − γ2 ) sin + (γ2 − γ3 ) sin + ... n n
(n − 1)sΩ + nγn sin (sΩ) . +(γn−1 − γn ) sin n
The derivative of the Fourier transform Ps (Ω) with respect to amplitude s reads: Ps (Ω) = where
2 2n · p2 (Ω) − · p1 (Ω) ; s Ωs2
(5)
sΩ 2sΩ p2 (Ω) = (γ1 − γ2 ) cos + (γ2 − γ3 ) cos + ... n n
(n − 1)sΩ + nγn cos (sΩ) . +(γn−1 − γn ) cos n
Corollary 2. If a periodic grayscale function can be expanded into a Fourier series: F (x) =
∞ 2kπx 2kπx a0 + + bk sin ak cos , ak , bk ∈ R, 2 λ λ
(6)
k=1
then, according to [14]
H (F (x) , ζs (t)) =
∞ 2kπ a0 2kπx 2kπx + + bk sin ak cos Ps . 2 λ λ λ
(7)
k=1
Elementary transformations help to compute the average of a time-averaged grayscale grating function: E (H (F (x) , ξs (t))) = its standard:
a0 ; 2
√ ∞
2kπ 2 2 2 2 σ (Hs (F (x) , ξs )) = (ak + bk ) · Ps ; 2 λ
(8)
(9)
k=1
and the derivative of the standard which is used as a measure of the encryption security (detailed reasoning is given in the next section): ∞ √ a2k + b2k · Ps 2kπ · Ps 2kπ λ λ 2 k=1 . σs (Hs (F (x) , ξs )) = 2 ∞ 2kπ 2 2 2 (ak + bk ) · Ps λ k=1
(10)
304
3
V. Petrauskiene et al.
The Construction and Solving of Optimization Problem
It is well known [14] that time-averaged moir´e fringes do not develop when a stepped moir´e grating is oscillated harmonically. On the other hand, timeaveraged fringes do form when a stepped moir´e grating (1) is oscillated by a time function which density function is a piece-wise uniform function comprising 2n equispaced columns. The clearest moir´e fringe formes at the amplitude of oscillations corresponding to the first root of Fourier transform of the density function [14]. The first time-averaged moir´e fringe formes at s = λ/2 for the uniform density function: the standard of the time-averaged moir´e grating is equal to zero then. The roots of the Fourier transform (eq. 4) of the piece-wise uniform density function are spread out periodically as well. Then the following question arises: which density function – uniform or piece-wise uniform – is better in respect of the security of information encryption? It is clear that the magnitude of the derivative of the standard at the amplitude corresponding to the formation of the first moir´e fringe can be considered as a measure of the encryption security. Thus, the following problem of combinatorial optimization is considered: find a vector (γ1 , γ2 , . . . , γn ) maximizing the target function ∞ 2kπ
√ a2k + b2k · Ps 2kπ · P s λ λ , σs s = λ = 2 k=1 (11) 2 ∞ 2 2kπ (a2k + b2k ) · Ps2 λ k=1 with the following constraints
n i=1
γi =
1 2
and γi > 0; i = 1, 2, . . . , n in force.
In order to reduce the computational costs of the problem we analyze an integer programming problem instead: we seek integer values of γ1 , γ2 , . . . , γn n 1 and then normalize them with respect to 2 · γi : (γ1 , γ2 , . . . , γn ). n i=1
2·
i=1
γi
The sum H = γ1 + γ2 + . . .+ γn is fixed (following the properties of the density function) what yields : ∞ 2kπ 2kπ √ a2k + b2k · Ps λ · Ps λ 2 k=1 λ → max , σ (s = ) = (12) · s 2 ∞ 2 2kπ 2 2 2 (ak + bk ) · Ps λ k=1 at
n
γi = H;
(13)
γi > 0, i = 1, 2, . . . , n.
(14)
i=1
where γi , i = 1, 2, . . . , n; H ∈ N.
Optimal Time Function for Secure Dynamic Visual Cryptography
305
It can be noted that the quantity of vectors (γ1 , γ2 , . . . , γn ), satisfying 13 and . 14 constrains is equal to Nγ = (H−n+1)(H−n+2) 2 We will use evolutionary algorithms for solving the problem 12-14. Every chromosome represents a vector (γ1 , γ2 , . . . , γn ). The length of each chromosome is 12 and the sum H = 60, i.e. a gene is the integer between 1 and 49. The width of the columns is fixed, thus the magnitude of a gene is proportional to the height column. The fitness of the chromosome is estimated of a corresponding by σs s = λ2 . The initial population comprises N randomly generated chromosomes. Each chromosome in the initial population was generated in such way that 13 and 14 requirements hold true. All chromosomes (γ1 , γ2 , . . . , γn ) lie on hyperplane, described by equation 13 and inequalities 14. The procedure of generation of the chromosomes is following: – generate an integer γ1 distributed uniformly over the interval [1; H − n + 1]; – generate an integer γ2 distributed uniformly over [1; H − n + 1 − γ1 ]; – ... n−2 – generate γn−1 distributed uniformly over 1; H − n + 1 − γi ; – calculate the gene γn = H − n + 1 −
n−1 i=1
i=1
γi .
Replications are allowed in the initial population. Therefore chromosomes (γ1 , γ2 , . . . , γn ) are distributed uniformly over the hyperplane, described by eq. 12 and eq. 14 and the probability for all chromosomes to be selected into the initial 1 1 1 · H−n · . . . · 12 · 1 = (H−n+1)! . population is uniform and equals to H−n+1 The fitness of each chromosome is evaluated and an even number of chromosomes is selected to the mating population (the size of the mating population is equal to the size of initial population). We use a random roulette method for the selection of chromosomes; the chance that a chromosome will be selected is proportional to its fitness value. All chromosomes are paired randomly when process of mating is over. The crossover between two chromosomes is executed for all pairs in the mating population. We use one-point crossover method and the location of this point is random. We introduce a crossover coefficient κ which characterizes a probability that the crossover procedure will be executed for a pair of chromosomes. If a chromosomes violates condition 13 after crossover, a norming procedure is applied: ⎛
⎞
⎛
⎞
⎛
γ2 , . . . , γ⎞ (γ1 ,⎛ n )⎞=
⎜ ⎟ ⎟ ⎟⎟ ⎜ ⎜ ⎜ ⎜round ⎜ H · γ1 ⎟ , round ⎜ H · γ2 ⎟ , . . . , round ⎜ H · γn ⎟⎟ . n n n ⎝ ⎠ ⎠ ⎠⎠ ⎝ ⎝ ⎝ γi γi γi i=1
i=1
(15)
i=1
If the new chromosome (γ1 , γ2 , . . . , γn ) violates condition 14, it is rounded to the nearest (H − n + 1)-digit number from n columns.
306
V. Petrauskiene et al.
In order to avoid the convergence to one local solution a mutation procedure is used. The mutation parameter μ (0 < μ < 1) determines the probability for a chromosome to mutate. The quantity of round (μ · N ) chromosomes is randomly selected to expose to the mutation and one gene of each chromosome is changed by adding a random number distributed uniformly over the interval [1; H − n + 1]. The norming procedure is applied for the mutated chromosomes. The following parameters of the evolutionary algorithms must be pre-selected: the crossover coefficient κ, the mutation parameter μ and the size of the population N . In order to tune the parameters κ and μ we construct an artificial problem – we seek a best density function comprising 6 columns (the length of a chromosome is 3) and H = 15. The optimal (full sorting) for this problem is the solution vector (1; 1; 13) and its fitness equals to σs s = λ2 = 0.656506722318812. Now evolutionary algorithms are commenced for the same problem; the population 20·2 40 = 182 ≈ 22.99% size is set to N = 20, what correspond to NNγ = (15−3+1)(15−3+2) of all chromosomes. We select the parameters κ and μ according to the frequency of optimal solution (1; 1; 13) in the population and according to the mean value of the fitness function. Three independent trials of evolutionary algorithms containing 5 generations were executed. The number of successful trials and the mean value of the fitness function of the population is highest at κ = 0.6 and μ = 0.05. Thus we fix these parameter values of the evolutionary algorithm and we seek a piece-wise uniform density function comprising 24 columns with H = 60 (it is unrealistic to solve such a problem using brute-force full sorting strategies). The number of possible solu= 1225. The size of the population is N =300 tions is Nγ = (60−12+1)(60−12+2) 2 300 which comprises NNγ = 1225 ≈ 24.49% of all chromosomes. The number of generations is set to 50 and the evolutionary algorithm is executed 5 times. The near-optimal set of γk , k = 1, 2, . . . , 12 reads [1; 1; 1; 1; 1; 1; 1; 1; 2; 1; 1; 48]/120; the near-optimal time function ξ (t) is shown in Fig. 2.
Fig. 2. The near-optimal time function ξ (t) as a realization of the near-optimal density function comprising 24 columns at H=60
Optimal Time Function for Secure Dynamic Visual Cryptography
307
Fig. 3. The secret image
Fig. 4. The secret image encoded into the background moir´e grating
Computational results show that the optimal density function gains maximal values at x = s and x = −s. In the limiting case the optimal density function reads: p (x) =
1 1 · δ−s (x) + · δs (x) 2 2
(16)
where δx0 (x) is a delta impulse function at x0 . It can be noted that then +∞
Ps (Ω) = −∞
1 (δs (x) + δ−s (x))e−ixΩ dx = cos (s · Ω) , 2
and the first time averaged fringe will form at s = λ/4.
4
Computational Experiments and Concluding Remarks
Computational experiments using the optimal time function with the proposed scheme of dynamic visual cryptography are performed using a secret image shown in Fig. 3. The secret image is encoded into a stepped stochastic moir´e background using phase regularization and initial phase randomization algorithms [13]. The secret image can be decrypted using the optimal time function show in Fig. 2
308
V. Petrauskiene et al.
Fig. 5. Contrast enhancement of the decrypted image
at s = λ/4 = 0.39 mm (contrast enhancement algorithms [16] have been used to make the decrypted image more clear). An optimal time function ensuring the highest security of the encoded image in the scheme based on dynamical visual cryptography is proposed. The optimality criteria is based on the derivative of the standard of the time averaged image. It is shown that interplay of extreme deflections from the state of equilibrium can be considered as a near-optimal realization of the decoding phase and can be effectively exploited in computational implementation of secure dynamic visual cryptography. Acknowledgments. Partial financial support from the Lithuanian Science Council under project No. MIP-041/2011 is acknowledged.
References 1. Naor, M., Shamir, A.: Visual cryptography. In: De Santis, A. (ed.) EUROCRYPT 1994. LNCS, vol. 950, pp. 1–12. Springer, Heidelberg (1995) 2. Shyu, S.: Efficient visual secret sharing scheme for color images. Pattern Recognit. 39, 866–880 (2006) 3. Zhou, Z., Arce, G., Crescenzo, D.: Halftone visual cryptography. IEEE Trans. Image Process. 15, 2441–2453 (2006) 4. Shyu, S.: Image encryption by random grids. Pattern Recognit. 40, 1014–1031 (2007) 5. Hu, C., Tseng, W.: Cheating prevention in visual cryptography. IEEE Trans. Image Process 16, 36–45 (2007) 6. Cimato, S., De Prisco, R., De Santis, A.: Colored visual cryptography without color darkening. Theor. Comput. Sci. 374, 261–276 (2007) 7. Yang, C.N., Chen, T.S.: Extended visual secret sharing schemes: improving the shadow image quality. Int. J. Pattern Recognit. Artificial Intelligence 21, 879–898 (2007) 8. Kobayashi, A.S.: Handbook on Experimental Mechanics, 2nd edn. SEM, Bethel (1993) 9. Patorski, K., Kujawinska, M.: Handbook of the moir´e fringe technique. Elsevier, Amsterdam (1993) 10. Post, D., Han, B., Ifju, P.: High sensitivity moir´e: experimental analysis for mechanics and materials. Springer, Berlin (1997)
Optimal Time Function for Secure Dynamic Visual Cryptography
309
11. Dai, F.L., Wang, Z.Y.: Geometric micron moir´e. Opt. Laser Eng. 31, 191–208 (1999) 12. Desmedt, Y., Van Le, T.: Moir´e cryptography. In: 7th ACM Conf. on Computer and Communications Security, pp. 116–124 (2000) 13. Ragulskis, M., Aleksa, A.: Image hiding based on time-averaging moir´e. Optics Communications 282, 2752–2759 (2009) 14. Ragulskis, M., Aleksa, A., Navickas, Z.: Image hiding based on time-averaged fringes produced by non-harmonic oscillations. J. Opt. A: Pure Appl. Opt. 11, 125411 (2009) 15. Ragulskis, M., Navickas, Z.: Hash functions construction based on time average moir´e. J. Discrete and Continuous Dynamical Systems-Series B 8, 1007–1020 (2007) 16. Ragulskis, M., Aleksa, A., Maskeliunas, R.: Contrast enhancement of time-averaged fringes based on moving average mapping functions. Optics and Lasers in Engineering 47, 768–773 (2009)
Vision-Based Horizon Detection and Target Tracking for UAVs Yingju Chen, Ahmad Abushakra, and Jeongkyu Lee Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA
Abstract. Unmanned Aerial Vehicle (UAV) has been deployed in a variety of applications like remote traffic surveillance, dangerous area observation, and mine removal, since it is able to overcome the limitations of ground vehicles. It can also be used for traffic controlling, border patrolling, accident and natural disaster monitoring for search and rescue purpose. There are two important tasks in the UAV system, automatic stabilization and target tracking. Automatic stabilization makes a UAV fully autonomous, while target tracking alleviates the overhead of a manual system. In order to address these, we present computer vision based horizon detection and target tracking for the videos captured by UAV camera. The proposed horizon detection algorithm is an enhancement of the Cornall’s Theorem and our target tracking employs optical flow. The results of both real and simulated videos show that the proposed algorithms are promising.
1
Introduction
Unmanned Aerial Vehicle (UAV) is categorized as an aerospace system that implements the reconnaissance, as well as aerial robotic vehicles [1, 2]. The UAV system is used in a variety of applications such as surveillance system, object recognition, dangerous area observation, maritime surveillance, and mine removal [2–4] because it has the capacities of overcoming the boundaries of ground robotic vehicles in reaching right locations for surveillance and monitoring [2]. There are two important tasks in the UAV system, namely automatic stabilization and target tracking. Automatic stabilization, i.e., roll angle estimation, makes a UAV fully autonomous, while target tracking alleviates the overhead of a manual system. The target objects include cars, roads, buildings, and any objects captured in a video. In [5] the authors used a circular mask to reduce image asymmetry and to simplify the calculations the horizon position. Yuan et al. [4] proposed a method to detect horizon in foggy condition. The algorithm is based on a dark channel described by He Kaiming. The horizon detection relies on the distinct intensity distributions of the sky and ground pixels. They defined a energy function with the second and the third terms representing the energy of sky and ground regions, respectively. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 310–319, 2011. c Springer-Verlag Berlin Heidelberg 2011
Vision-Based Horizon Detection and Target Tracking for UAVs
311
The existing tracking techniques achieve good performance with a stationary camera and good image quality. However, the fast moving camera in a UAV often results in abrupt discontinuities in motion, which makes target tracking a very challenging task. In [6], the authors integrated a spatio-temporal segmentation and a modified statistical snake algorithm for detecting and tracking stationary and moving objects. In [7], the authors used multiple thresholds to segment the motion region, and they then extracted the motion region to computed its center of gravity. In this paper, we present a vision based system for horizon detection and target tracking for UAVs. Our main contribution of this work is the horizon detection, which is an enhancement over the Cornall’s Theorem. As for target tracking, we investigated and implemented two popular algorithms, namely Continuously Adaptive Mean SHIFT (CAMShift) [8] and Lucas & Kanade [9] algorithms. The remainder of this paper is organized as follows: Section 2 describes our proposed enhancements for horizon detection. Section 3 presents our target tracking algorithms. Section 4 is the experimental results and discussion. Finally, Section 5 is our concluding remarks.
2
Horizon Detection
In this section, we describe the Cornall’s Theorem and our enhancement of horizon detection. The Cornall’s Theorem is the basis of our detection algorithm and our enhancement includes color transformation, adaptive threshold adjustment and noise reduction. 2.1
Cornall’s Theorem
A horizon can be detected by measuring the roll angle of the UAV. For example, the horizon in Fig. 1 is the line CD and it can be represented by the angle φ. The measurement of horizon angle relies on proper classifying of pixels within a circular mask, laid in the center of current image, into sky or ground classes. In this figure, line AB that connects the sky ‘A’ and the ground ‘B’ centroid bisects both sky and ground classes in a right angle, regardless of the roll and pitch angles of the aircraft, as long as the horizon in view is a straight line. The Cornall’s Theorem is proved by Cornall et al. in [5] and it is defined as: Theorem 1. For a circular viewport, the line joining the centroids of the sky and ground classes will bisect the horizon at a right angle, regardless of the roll angle and of the pitch angle, as long as the horizon makes a straight line in the view. Proof. Omitted. In [5], the image is converted into gray-scale image and the pixels are classified using a predetermined threshold. Once all the pixels in the circular mask are
312
Y. Chen, A. Abushakra, and J. Lee
Fig. 1. The horizon is perpendicular to the line joining the sky and ground centroids. In this figure, ‘A’ represents the sky centroid and ‘B’ is the ground centroid.
classified, the centroids of the sky and ground classes are obtained by calculating the average coordinates. The gradient of horizon is calculated as the inverse of the gradient of the line that joins the sky and ground centroids and the gradient of horizon m is defined as follows: m=
XS − XG YS − YG
(1)
where (XS , YS ) and (XG , YG ) are the coordinations of the sky and ground centroids, respectively. From (1), the horizon angle φ, i.e., detected horizon, can be computed as: φ = arctan(m) = arctan(
2.2
XS − XG ) YS − YG
(2)
Our Proposed Enhancements
Cornall’s Theorem works pretty well in most of the case; however, there are some disadvantages in the original approach. For example, gray-scale image is not robust for sky-ground discrimination because the light intensity of the sky is not at constant. On a sunny day, the sky intensity may be appear low and misclassification may occur. In addition, a fixed threshold cannot effectively respond to different sky and weather patterns. To address these issues, we propose the following enhancements on the Cornall’s Theorem: (1) enhanced ground model using CMYK color space; (2) adaptive threshold selection by sampling the four corners, and (3) noise reduction using connected-component. We will describe the detail of each parts in the following paragraphs. An Enhanced Ground Model Using CMYK Color Space. Cornall’s Theorem has a low complexity in computation; however, binary classification using gray-scale image is prone to error because the light intensity of the sky varies along the time.
Vision-Based Horizon Detection and Target Tracking for UAVs
313
Fig. 2. From RGB image to CMYK image to YK image. Among the C, M,Y, K channels, it’s obvious that the Y and K channels of clearly separate sky and ground; therefore, the combination of these two channels are suitable for discriminating the sky and ground pixels.
In order to address the problem, we explored the CMYK color space. Fig. 2 is the color transformation process that converts a RGB image into the CMYK color space. It is obvious that the yellow (Y) and black channels (K) exhibit excellent property to model the ground because they do not describe the blue color, which is a predominant color that represents the sky and ocean. For this property, the ground can be better modeled using just the Y and K channels. There are several ways to transform a RGB image into CMYK image[10– 12] and our implementation uses: C = 1 − R, M = 1 − G, Y = 1 − B, K = min(C, M, Y ) where R, G,and B are normalized to 1. Once the Y and K channels are extracted, an enhanced intensity image of the ground is created using (3). The intensity Ix,y of a pixel is defined as follows: Yx,y + Kx,y (3) 2 where Yx,y is the pixel value in the Y channel and Kx,y in the K channel. Fig. 2 shows a sample output of our featured transformation. Ix,y =
Adaptive Threshold Selection. In Cornall’s theorem [5], a predetermined threshold is used for sky and ground classification. The classification is very sensitive to the threshold; therefore, the value of the threshold should be robust enough to sky-ground classification regardless the sky light intensity and weather conditions. In [13], the authors applied Otsu’s histogram analysis for the threshold. In our paper, we compute the mean intensity value of four macro block, 20 × 20 each, located at the four corners of the YK image. As long as the horizon presents in the image, it divides the image into the sky and the ground. Therefore, at least one of the corners belongs to the sky. Based on this assumption, the threshold is decided based on the minimal mean M of the four corners using rules. Based on our empirical results, the if...then... conditions are defined as follows:
314
Y. Chen, A. Abushakra, and J. Lee
Algorithm 1. Adaptive Threshold Selection Require: M Ensure: threshold is selected according to M 1: if M 0.17 then 2: threshold ← 0.83 3: else if 0.17 < M 0.25 then 4: threshold ← 0.78 5: else if 0.25 < M 1 then 6: threshold ← 0.3 7: end if 8: return threshold
Case 1 of Algorithm 1 works well during daytime including rainy day. Case 2 is helpful when we have deep blue sky and case 3 is a typical case when flying in the dusk. Noise Reduction. Although we have introduced the YK color space and a rulebased approach for threshold selection, it is possible to have some scattered misclassification, i.e., noise. To remove such noise, we utilize connected-component analysis [14] to filter out noise. First we apply morphological operation open to shrink areas with small noises followed by morphological operation close to rebuild the eroded areas of surviving components that was lost in previous operation. We perform this process once to remove the noise created during the classification process (see Fig. 3). The operation of open is carried out by performing erode first followed by dilate while the operation of close is done by performing dilate first then erode. The two basic morphological transformations, dilation and erosion are defined as follows: ⎧ ⎨erode(x, y) = ⎩dilate(x, y) =
(a)
min
src(x + x , y + y )
max
src(x + x , y + y )
(x ,y )∈kernal
(x ,y )∈kernal
(4)
(b)
Fig. 3. Noise reduction using connected component. (a) is the before-and-after image when connected component analysis is applied to the gray-scale image and (b) is the image when same process is applied to the YK image.
Vision-Based Horizon Detection and Target Tracking for UAVs
3
315
Target Tracking
Target tracking is one of the important modules of UAV systems to conduct traffic controlling, border patrolling, accident and natural disaster monitoring for search and rescue purpose. For the tracking there are two types of algorithms in computer visions; i.e., probabilistic tracking [8] and estimator [15]. A probabilistic tracking algorithm finds a target that is matched with properties, while an estimator predicts the state or location of target along the time. In UAV systems, tracking algorithms based on estimators are more effective than ones with probabilistic tracking. Since a camera of UAV is also moving (i.e., flying) while it is capturing a video, an estimator is more effective to track the target objects. For example, Continuously Adaptive Mean SHIFT (CAMShift) algorithm [8] does not work very well in our application. Not only there are a lot of parameters to tune but we also need to consider the case when the target has similar color distribution with the neighboring surface. For this reason, an estimator which is based on optical flow is selected for our target tracking. First, the target object is selected by a user manually and then our algorithm can track the selected target along the time. Since the target object will be manually identified by the operator, there is no way we know the object before hand. Tracking of unidentified object usually involves tracking of significant feature points. The feature points selected for tracking are corners. Once the good features to track are obtained using Shi and Tomasi’s method, we track these feature points using pyramid Lucas & Kanade algorithm. Lucas & Kanade algorithm was initially introduced in [9] and the basic idea rests on the following assumptions: (1) Brightness constancy, where the brightness of a pixel does not change from frame to frame; (2) Temporal persistence, where a small patch on the surface moves slowly; (3) Spatial coherence, where neighbor pixels in the same surface have similar motion. Based on the first assumption, the following equation is defined. I(x, y, t) = I(x + δx, y + δy, t + δt)
(5)
If we expand I(x, y, t) into a Taylor series and consider the second assumption, we can obtain the (6) where u and v are the velocities of the x and y components, respectively. Ix u + Iy v = −It
(6)
For most of the cameras running at 30Hz, Lucas-Kanade’s assumption of small and coherent motion is often violated. In fact, most videos present large and non-coherent motions. With an attempt to catch large motion using large window it often breaks the coherent motion assumption. To address this issue, a recommended technique is to use the pyramid model by solving the optical flow at the top layer and then use the estimated motion as the starting point for the next layers until we reach the lowest layer. This way the violations of the motion assumption is minimized and we can track faster and longer and motions [14].
316
Y. Chen, A. Abushakra, and J. Lee
Fig. 4. Base station graphical user interface (GUI)
4.5
8
4
6
3.5 Error (in Degree)
Error (in Degree)
5
10
4 2 0
Proposed Original
3 2.5 2 1.5
−2 1
−4 0.5
−6 1
2
3
4
5
6
7
8
9
10
11
12
13
(a) Error distribution
14
15
0
0
2
4
6
8
10
12
14
(b) Original v.s. Proposed
Fig. 5. Horizon Detection. In (b), video 7 (dusk) was removed from the comparison because the original approach failed to detect the horizon.
4
Experimental Results
To assess the proposed algorithm for horizon detection and target tracking, we developed a graphical user interface, Fig. 4, using Microsoft MFC. With this GUI, the ground operator can watch the first person view of the UAV and mark the target for tracking. Horizon detection and target tracking are implemented using Open Source Computer Vision (OpenCV) 2.1. OpenCV [16] is a library of programming functions for real time computer vision using C++. For the ease of validating our modules, the GUI allows two types of input: (1) AVI file and (2) video streaming from connected camera. To evaluate our proposed enhancements, we used off-the-shelf flight simulator to generate the test data set. In addition, we also tested three videos recorded from the wireless UAV camera. In this experiment we have 6 simulated videos over cities, 3 simulated videos over outback regions, and 6 videos over country fields. 3 of the field videos are taken from the real video recorded using wireless UAV camera. The cruising speed of the simulated flight varies from 80 MPH to 156 MPH and the relative elevation ranges from 300 FT to 1600 FT. To evaluate the accuracy of horizon detection, we randomly selected 450 images test dataset, e.g., 30 images from each video. Then, the detected horizons
Vision-Based Horizon Detection and Target Tracking for UAVs
(a) Horizon detection
317
(b) Target tracking
Fig. 6. (a) Sample output of horizon detection where the top three rows are taken from on-board camera videos and the others are simulated videos. (b)Target tracking using pyramid Lucas-Kanade algorithm. Video 2
Video 1 Ground Truth Estimation
30
Roll Angle (in Degree)
20
Video 3
Video 5
30
30
30
20
20
20
20
10
10
Video 4
30
10
10
10
0
0
0
0
0
−10
−10
−10
−10
−10
−20
−20
−20
−30
−30 0
5
10
15
20
25
−20
−30
0
30
5
10
Video 6
15
20
25
30
0
−20
−30 5
10
Video 7
15
20
25
30
−30
0
5
10
Video 8
15
20
25
30
0
30
30
30
30
20
20
20
20
20
10
10
10
10
0
0
0
0
0
−10
−10
−10
−10
−10
−20
−30 0
−20
−30 5
10
15
20
25
30
5
10
Video 11
15
20
25
30
0
10
Video 12
15
20
25
30
5
10
Video 13
15
20
25
30
0
30
30
20
20
20
20
20
10
10
10
10
0
0
0
0
0
−10
−10
−10
−10
−10
−20
−20
−30 5
10
15
20
25
0
10
15
20
25
30
35
0
10
15
20
25
30
15
20
25
30
−20
−30 5
10
10
−20
−30 5
30
Video 15
30
0
5
Video 14
30
−30
25
−30
0
30
−20
20
−20
−30 5
15
10
−20
−30
0
10
Video 10
30
−20
5
Video 9
0
−30 2
4
6
8
10
12
14
16
18
20
0
5
10
15
20
25
30
35
40
Fig. 7. Estimated roll angle v.s. Manual measurement
are compared with ground-truth manually. Fig. 7 is plots of the result for our estimated roll angles vs. manually measured roll angles for each video. In this figure, the estimated roll angles match pretty close to the measured angles. Fig. 5 is a box plot of the error distribution of each video where the green line is the average of the error in degree. The outliers in this figure are caused by the reflection of distant pond or withered grass patch. A list of sample output of horizon detection is also shown in Fig. 6(a). As mentioned in Section 3, the performance of CAMShift is not good hence we turn our attention to pyramid Lucas-Kanade. A sample output of using pyramid Lucas-Kanade is available in Fig. 6(b). Based on our observation, pyramid Lucas-Kanade’s algorithm works well for modeling the still objects (e.g. building, landmark, etc.) or slow moving objects (e.g., cruise); however, for fast moving
318
Y. Chen, A. Abushakra, and J. Lee
objects (e.g. truck on the freeway) or objects not moving in line with the UAV, it tends to lose the object after a while and manual remarking of the target is inevitable.
5
Conclusion
There are a variety of applications deployed on UAVs. To relief the manual overhead involved in these applications, we propose a vision based system that implemented horizon detection and target tracking. Horizon detection could help to control the UAV while target tracking alleviates the overhead of a manual system. The automatic stabilization of a UAV could be implemented using detected horizon so that the ground operators only need to concentrate on identifying targets for tracking. Unlike Cornall’s original approach that classifies the sky and ground uses gray-scale images, we describe an enhanced approach for efficient sky-ground classification. First we convert the RGB image to the CMYK color space and then we generate a image that models the ground using the Y and K channels. In order to select an appropriate threshold for sky-ground classification, we sampled the four corners of the image using a macro block and then the threshold is determined using rules. After the threshold is selected, we classify the pixels into the sky and ground classes and then we apply connected-component to filter out unwanted noise. With all the pixels are classified into the sky and ground classes, we compute the average coordinates as the centroid for both classes. With the coordinates of the sky and ground centroids are obtained, we compute the horizon angle (or roll angle of the RC aircraft) using the Cornall’s Theorem and our experiments showed that our proposed method is promising. The second task implemented in the UAV system is target tracking. In searching for a good solution, we evaluated two algorithms, CAMShift and pyramid Lucas-Kanade algorithms. Based on our experiments, CAMShift does not work very well in UAV environment because there are cases when the targets and the neighboring surfaces have similar color distributions. On the other hand, optical flow is generally used to observe the motion of the tracked object; therefore, it suits UAVs well. According to our tests, optical flow is able to model the motion of still objects (e.g., buildings, landmarks, and slow moving objects) well, even when the targets have similar color distributions as the neighboring surface. However, for fast moving objects, this algorithm loses the object eventually and the ground operator needs to mark the target for tracking again.
References 1. Zhang, J., Yuan, H.: The design and analysis of the unmanned aerial vehicle navigation and altimeter. In: International Conference on Computer Science and Information Technology, ICCSIT 2008, pp. 302–306 (2008) 2. Merino, L., Caballero, F., Martinez-de Dios, J.R., Ollero, A.: Cooperative fire detection using unmanned aerial vehicles. In: Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2005, pp. 1884–1889 (2005)
Vision-Based Horizon Detection and Target Tracking for UAVs
319
3. Caballero, F., Merino, L., Ferruz, J., Ollero, A.: A visual odometer without 3d reconstruction for aerial vehicles. applications to building inspection. In: Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2005, pp. 4673–4678 (2005) 4. Yuan, H.Z., Zhang, X.Q., Feng, Z.L.: Horizon detection in foggy aerial image. In: International Conference on Image Analysis and Signal Processing, IASP 2010, pp. 191–194 (2010) 5. Cornall, T.D., Egan, G.K.: Measuring horizon angle from video on a small unmanned air vehicle. In: 2nd International Conference on Autonomous Robots and Agents (2004) 6. Zhang, S.: Object tracking in unmanned aerial vehicle (uav) videos using a combined approach. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2005, vol. 2, pp. 681–684 (2005) 7. Ding, W., Gong, Z., Xie, S., Zou, H.: Real-time vision-based object tracking from a moving platform in the air. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 681–685 (2006) 8. Bradski, G.R.: Computer vision face tracking for use in a perceptual user interface (1998) 9. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision, pp. 674–679 (1981) 10. Taniguchi, K.: Digital image processing-application (2002) 11. Zheng, N.: Computer Vision and Pattern Recognition. National Defense Industry Press, Beijing (1998) 12. Ford, A., Roberts, A.: Colour space conversions. Westminster University, London (1998) 13. Cornall, T., Egan, G., Price, A.: Aircraft attitude estimation from horizon video. IEE Electronics Letters 42, 744–745 (2006) 14. Bradski, G., Kaehler, A.: Learning OpenCV. O’Reilly Media, Inc., Sebastopol (2008) 15. Han, Z., Ye, Q., Jiao, J.: Online feature evaluation for object tracking using kalman filter. In: Proceedings of IEEE International Conference on Pattern Recognition, pp. 3105–3108 (2008) 16. Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)
Bag-of-Visual-Words Approach to Abnormal Image Detection in Wireless Capsule Endoscopy Videos Sae Hwang Department of Computer Science, University of Illinois at Springfield, Springfield, IL, USA
Abstract. One of the main goals of Wireless Capsule Endoscopy (WCE) is to detect the mucosal abnormalities such as blood, ulcer, polyp, and so on in the gastrointestinal tract. Only less than 5% of total 55,000 frames of a WCE video typically have abnormalities, so it is critical to develop a technique to automatically discriminate abnormal findings from normal ones. We introduce “Bag-ofVisual-Words” method which has been successfully used in particular for image classification in non-medical domains. Initially the training image patches are represented by color and texture features, and then the bag of words model is constructed by K-means clustering algorithm. Subsequently the document is represented as the histogram of the visual words which is the feature vector of the image. Finally, a SVM classifier is trained using these feature vectors to distinguish images with abnormal regions from ones without them. Experimental results on our current data set show that the proposed method achieves promising performances. Keywords: Wireless Capsule Endoscopy, Abnormality, Bag-of-Visual-Words, SVM classifier.
1 Introduction Wireless Capsule Endoscopy (WCE) is a relatively new technology (FDA approved in 2002) allowing doctors to view most of the small intestine [1]. Previous endoscopic imaging modalities such as colonoscopy, upper gastrointestinal endoscopy, push enteroscopy and intraoperative enteroscopy could be used to visualize up to the stomach, duodenum, colon and terminal ileum, but there existed no method to view most of the small intestine without surgery. With the miniaturization of wireless and camera technologies the entire gestational track can now be examined with little effort. A tiny disposable video capsule is swallowed, which transmits two images per second to a small data receiver worn by the patient. During an approximately 8-hour course, over 55,000 images are recorded, which are then downloaded to a computer for later examination. Typically, a medical clinician spends one or two hours to analyze a WCE video. To reduce the assessment time, it is critical to develop a technique to automatically analyze WCE videos. Most of the research works done in WCE could be divided into three main categories [2-5]: (1) image enhancement (2) abnormality detection and (3) video segmentation and frame reduction. In this paper, we are studying the abnormality detection in WCE. The important abnormal lesions (abnormalities) in G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 320–327, 2011. © Springer-Verlag Berlin Heidelberg 2011
Bag-of-Visual-Words Approach to Abnormal Image Detection
321
WCE are fresh blood (bleeding), ulceration, erosion, angioectasia, polyps, and tumors. In a typical WCE video, less than 5% of frames are abnormal images. Figure 1 shows some abnormal WCE images such as blood, ulcer and polyp.
Fig. 1. Abnormal images: (a) Blood, (b) Ulcer and (c) Polyp
Since there is a large number of images in a video, this examination is an extremely time consuming job for the physician. It limits its general application and incurs considerable amount of health-care costs. To address this requirement, we propose a new algorithm utilizing the bag-of-visual-words method [6-9] which is successfully used in object and scene recognition. Our method focuses on distinguishing regions showing abnormalities such as blood, polyp, and ulcer from WCE images. We treat the abnormal regions as the positive documents and the normal regions as the negative documents. Firstly we extract the statistical color features and Gabor-filter-based texture features for each patch in the documents. Secondly we construct the bag of words by K-means clustering algorithm and represent the documents by the histograms of the visual words which are treated as the feature vectors of the images. Finally, we train the feature vectors and obtain the SVM classifier. Experimental results on our current data set show that the proposed method achieves promising performances, which can be used to detect abnormal images in practice. The remainder of this paper is organized as follows. Bag-of-Visual-Words model is discussed in section 2. The extracted features for the codebook construction are discussed in Section 3 and the support vector machine classifier is discussed in Section 4. We discuss our experimental results in Section 5. Finally, Section 6 presents some concluding remarks.
2 Bag-of-Visual-Words Model The bag-of-visual-words model is a simple assumption used in natural language processing and information retrieval, and has been widely applied in the computer vision field. In general, there are three main steps for the model: (i) obtain the local feature descriptors; (ii) quantize the descriptors into a codebook; (iii) describe the image as a collection of the words. As shown in Figure 2, the procedure includes two parts: learning and recognition. In the learning process, we get the local feature descriptors for each image firstly. Each image is broken into a 8-by-8 grid of patches.
322
S. Hwang
For each patch, color and texture features are computed. Secondly, we quantize these features by K-means clustering algorithm to form a codebook, and images can be represented as the histograms of the visual words which are the feature vectors of the images. Finally, we train these feature vectors and obtain the SVM classifier. In the recognition process, the test image is represented by the code words, and is distinguished by the SVM classifier. Learning Class 1
Recognition Class N
Unknown Image
...
Feature Extraction & Representation
Feature Extraction & Representation
K‐Means Codebook Generation
Class 1
Class N
... SVM Classifier
Abnormality detection
Fig. 2. Bag-of-words model
3 Visual Feature Extraction To describe a document, we want to capture the overall texture and color distributions in local regions. We compute region-based features as shown in Figure 3. Each image is broken into a 8-by-8 grid of uniformly sized patches. For each patch, we compute three statically color features (i.e., the mean, standard deviation, and skewness) in HIS color space and texture features using Gabor filters, which results in 51-dimensional feature vectors.
Bag-of-Visual-Words Approach to Abnormal Image Detection
323
Fig. 3. Process of feature extraction
3.1 Color Features To analyze a color image or color feature, choosing a suitable color space is the primary task and greatly influences the performance of the algorithms such as image segmentation, diseases diagnosis and so on. When colors are displayed on a computer monitor, they are usually defined in the RGB (red, green and blue) color space. It is because of the fact that its compatibility with additive color reproduction systems. However, the RGB space usually is not suitable for other image processing. That is because it is far from exhibiting the perceptual uniformity which is defined as the numerical differences correspond to perceptual ones. Furthermore, applying the grey scale algorithms directly to RGB components of the image may result into color shifts because of the high correlation among RGB channels for natural images [10]. This kind of color shifts is undesired especially in medical images because the color plays a crucial role in deciding the status of the body tissues and organs. HSI color space has some property fitted in with disease detection. Firstly, HSI space has the similarity to the means that human eyes percept the world. In other words, an image signal is decomposed into chrominance (H and S) and luminance (I) components in HSI space, and this is exactly what happens in the human visual system. This property facilitates color feature extraction because we can extract color information from the chrominance plane and the intensity plane separately. And then, HSI space shows an outstanding property of color invariants. We compute three statistical features (the mean, standard deviation, and skewness) of each HSI color channel and the conversion from RGB space to HSI space is calculated as follows 1 ⎧ [( R − G ) + ( R − B )] ⎫⎪ B ≤ G where ⎧ θ ⎪ 2 H =⎨ θ = arccos ⎨ 1 ⎬ ⎩360 − θ B > G ⎪ [( R − G ) 2 + ( R _ G )(G − B ) ]2 ⎪ ⎭ ⎩ 3 (1) S =1− [min( R, G , B )] R+G+ B 1 I = (R + G + B) 3
324
S. Hwang
3.2 Texture Features Based on Gabor Filters Gabor filters have been widely used in image processing over the past two decades. Gabor wavelet kernels have many common properties with mammalian visual cortical cells [11]. These properties are orientation selectivity, spatial localization and spatial frequency characterization. In this sense, Gabor filters offer the best simultaneous localization of spatial and frequency information. A 2-D Gabor filter is an oriented complex sinusoidal grating modulated by a 2-D Gaussian function, which is given by the following: h( x, y ) = g ( x, y ) exp( 2πj (Ux + Vy ) = hR ( x , y ) + jhI ( x , y )
(2)
where (U ,V ) is a spatial frequency, g ( x, y ) is the Gaussian function with scale parameter σ and hR ( x, y ) , hI ( x, y ) are the real and imaginary parts of h( x, y ) respecttively. g ( x, y ) =
1 2 πσ
2
exp( −
x2 + y2 ) 2σ 2
(3)
The Gabor filter is a bandpass filter centered on frequency (U ,V ) with bandwidth determined by σ. The parameters of a Gabor filter are represented by the spatial frequency U ,V and scale σ. In general, a radial frequency F ( F = U 2 + V 2 ) , orientation θ (θ = tan −1 (V / U )) and σ are used instead in polar coordinates. The Gabor filtered output of an image i ( x, y ) is obtained by the convolution of the image with the Gabor function h( x, y ) with adjustable parameters (f ,θ ,σ). We use f={0, 2, 4, 8, 16, 32, 64}, θ ={ 0, π/6, π/3, π/2, 2π/3, 5π/6}, and σ=4 in our experiments.
4 Support Vector Machines Support vector machines (SVMs) have recently drawn considerable attention in the machine learning community due to their solid theoretical foundation and excellent practical performance. They are kernel-based learning algorithms derived from the statistical learning theory [12, 13]. SVMs have several advantages over the other classifiers such as decision trees and neural networks. The support vector training mainly involves optimization of a convex cost function. Therefore, there is no risk of getting stuck at local minima as in the case of backpropagation neural networks. Most learning algorithms implement the empirical risk minimization (ERM) principle which minimizes the error on the training data. On the other hand, SVMs are based on the structural risk minimization (SRM) principle which minimizes the upper bound on the generalization error. Therefore, SVMs are less prone to overfitting when compared to algorithms that implement the ERM principle such as backpropagation neural net-works. Another advantage of SVMs is that they provide a unified framework in which different learning machine architectures (e.g., RBF networks, feedforward neural networks) can be generated through an appropriate choice of kernel. Consider a set of n training data points {xi, yi} ∈ Rd × {−1, +1}, i = 1, … , n , where R is a hyperplane, xi represents a point in d-dimensional space and yi is a twoclass label. Suppose we have a hyperplane that separates the positive samples from
Bag-of-Visual-Words Approach to Abnormal Image Detection
325
the negative ones. Then the points x on the hyperplane satisfy w•x+b = 0, where w is the normal to the hyperplane, |b|/||w|| is the perpendicular distance from the hyperplane to the origin, and ||w|| is the Euclidean norm of w. If we take two such hyperplanes between the positive and negative samples, the support vector algorithm’s task is to maximize the distance (margin) between them. In order to maximize the margin, ||w||2 is minimized subject to the following constraints: yi(w•x+b) ≥1 −
ξ1 ,
ξ1 ≥0 ∀i
(4)
ξ1 , i = 1, … , n are positive slack variables for non-linearly separable data. The training samples for which Equation (4) hold are the only ones relevant for the classification. These are called the support vectors. The Lagrangian function for the minimization of ||w||2 is given by: Lk =
n
∑α i =1
i
−
1 2
subject to
n
n
∑∑
i =1 j =1
y i y j α iα j K ( x i , x j )
0 ≤ α i ≤ C and
n
∑α y
i i
=0
(5)
i =1
C is a penalty parameter to control the trade-off between the model complexity and the empirical risk, and K is a kernel function. This formulation allows us to deal with extremely high (theoretically infinite) dimensional mappings without having to do the associated computation. Some commonly used kernels are:
xiT ⋅ x j
•
Linear: K ( x i , x j ) =
•
Polynomial: K ( x i , x j ) =
• •
(γxiT ⋅ x j + r ) d , γ > 0
⎛ − γ || x j − x j ||2 ⎞ ⎟, γ > 0 Radial basis function (RBF): K ( x i , x j ) = exp⎜ 2 ⎜ ⎟ 2 σ ⎝ ⎠ T Sigmoid: K ( x i , x j ) = Tanh (γxi ⋅ x j + r ), γ > 0
In this study, the radial basis function (RBF) was adopted for various reasons [14]. Firstly, the linear kernel cannot handle nonlinearly separable classification tasks, and in any case, is a special case of the RBF kernel. Secondly, the computation of the RBF kernel is more stable than that of the polynomial kernel, which introduces values of zero or infinity in certain cases. Thirdly, the sigmoid kernel is only valid (i.e. satisfies Mercer’s conditions) for certain parameters. Finally, the RBF kernel has fewer hyper parameters ( γ ) which need to be determined when compared to the polynomial ( γ , r, d ) and sigmoid kernels ( γ , r ).
5 Experimental Results In this section, we assess the effectiveness of the proposed abnormal detection technique. In our experiments, we used a set of 250 images. The set contains 50 polyp
326
S. Hwang
images, 50 blood images, 50 ulcer and 100 normal (no abnormality) images, each with a resolution of 256 x 256 pixels. The set of images in each class are divided into two categories: training and testing. From each class, 50% of the images were chosen for training and the other half images were chosen for testing. First, we evaluate the influence of the codebook size because the number of codebook centers is one of the major parameters of the system [15]. Figure 4 shows the accuracies of the abnormality detection based on the different codebook sizes. There are initially substantial increases in the performance as the codebook size grows. However, there is no improvement when the codebook size is bigger than 400 and there is a decrease in the performance for the large codebooks.
Fig. 4. Abnormality detection accuracy based on different codebook sizes
Table 1 shows the experimental results of our abnormal frame detection technique on the 125 test images when the codebook size is 400. It can be seen that the presented method achieves 77% sensitivity and 91% specificity in average. It can also be seen that polyp detection has the lowest sensitivity (lower than 70%). The reason for this is that polyps have texture and color very similar to normal tissues, making it hard to distinguish them in our data set. Table 1. Abnormality detection results
Class Blood Polyp Ulcer Normal Ave
Sensitivity 0.82 0.66 0.74 0.86 0.77
Specificity 0.98 0.95 0.94 0.77 0.91
6 Concluding Remarks Finding abnormalities in WCE videos is a major concern when a gastroenterologist reviews the videos. In this paper, we propose a novel method for abnormal image
Bag-of-Visual-Words Approach to Abnormal Image Detection
327
detection in WCE videos based on “bag-of-visual-words” approach. Preliminary experiments demonstrate that the proposed method can classify WCE images into four classes with 77% sensitivity and 91% specificity. By achieving abnormal image detection for blood, polyp, and ulcer, we can reduce reviewing time of the physicians. In the future, we are planning to extend our method to detect more minor abnormalities such as erythema and tumor. We are also considering other features as the visual words such as the geometric shape features.
References 1. Bresci, G., Parisi, G., Bertoni, M., Emanuele, T., Capria, A.: Video Capsule Endoscopy for Evaluating Obscure Gastrointestinal Bleeding and Suspected Small-Bowel Pathology. J. Gastroenterol 39, 803–806 (2004) 2. Li, B., Meng, M.Q.-H.: Wireless capsule endoscopy images enhancement using contrast driven forward and backward anisotropic diffusion. In: IEEE International Conference on Image Processing (ICIP), San Antonio, Texas, USA, vol. 2, pp. 437–440 (September 2007) 3. Li, B., Meng, M.Q.-H.: Computer-based detection of bleeding and ulcer in wireless capsule endoscopy images by chromaticity moments. Computers in Biology and Medicine 39(2), 141–147 (2009) 4. Hwang, S., Celebi, M.E.: Polyp Detection in Wireless Capsule Endoscopy Videos Based on Image Segmentation and Geometric Feature. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP (2010) 5. Lee, J., Oh, J., Yuan, X., Tang, S.J.: Automatic Classification of Digestive Organs in Wireless Capsule Endoscopy Videos. In: Proc. of the ACM Symposium on Applied Computing, ACM SAC 2007, March 11 - 15 (2007) 6. Vigo, D.A.R., Khan, F.S., van de Weijer, J., Gevers, T.: The Impact of Color on Bag-ofWords Based Object Recognition. In: International Conference on Pattern Recognition (ICPR), pp. 1549–1553 (August 2010) 7. Gupta, S., Kim, J., Grauman, K., Mooney, R.J.: Watch, listen & learn: Co-training on captioned images and videos. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 457–472. Springer, Heidelberg (2008) 8. Li, W.-J., Yeung Localized, D.-Y.: content-based image retrieval through evidence region identification. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami Beach, Florida, USA, June 20-25 (2009) 9. Fei-Fei, L., Perona, P.: A Bayesian Hierarchical Model for Learning Natural Scene Categories. IEEE Comp. Vis. Patt. Recog. (2005) 10. Li, C.H.: Regularized color clustering in medical image database. IEEE Trans. Med. Imaging 19, 1150–1155 (2000) 11. Webster, M.A., De Valois, R.L.: Relationship between Spatial-Frequency and Orientation Tuning of Striate-Cortex Cells. J. Opt. Soc. Am. A 2, 1124–1132 (1985) 12. Vapnik, V.: Statistical learning theory. Wiley, Chichester (1998) 13. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining Knowledge Discovery 2(2), 121–167 (1998) 14. Keerthi, S.S., Lin, C.-J.: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15(7), 1667–1689 (2003) 15. Dance, C., Willamowski, J., Fan, L.X., Bray, C., Csurka, G.: Visual categorization with bags of keypoints. In: ECCV Workshop on Statistical Learning in Computer Vision, pp. 1–22 (2004)
A Relevance Feedback Framework for Image Retrieval Based on Ant Colony Algorithm Guang-Peng Chen1, Yu-Bin Yang1, Yao Zhang2, Ling-Yan Pan1, Yang Gao1, and Lin Shang1 1
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China 2 Jinling College, Nanjing University, Nanjing 210089, China [email protected]
Abstract. To utilize users’ relevance feedback is a significant and challenging issue in content-based image retrieval due to its capability of narrowing the “semantic gap” between the low-level features and the higher-level concepts. This paper proposes a novel relevance feedback framework for image retrieval based on Ant Colony algorithm, by accumulating users’ feedback to construct a “hidden” semantic network and achieve a “memory learning” mechanism in image retrieval process. The proposed relevance feedback framework adopts both the generated semantic network and the extracted image features, and then re-weights them in similarity calculation to obtain more accurate retrieval results. Experimental results and comparisons are illustrated to demonstrate the effectiveness of the proposed framework. Keywords: Image Retrieval, Relevance Feedback, Ant Colony Algorithm, Memory Learning, Semantic Network.
1
Introduction
With the rapid growth of the number of digital images both on the Internet and in digital libraries, image retrieval has been actively studied in recent years. Content-based image retrieval (CBIR) techniques were then adopted to help users search for similar images by using low-level image features such as color, texture, shape, and so on. However, the similarity of low-level contents doesn’t accurately reflect that of higher-level concepts. This “semantic gap” finally leads to the limited performance of CBIR systems [1]. In order to address this problem, Relevance Feedback (RF) was introduced into CBIR research, attempting to capture users’ retrieval request more precisely through iterative and interactive feedbacks. It has been considered as an efficient way to reduce the semantic gap, thus many studies have focused on how to apply RF to improve the performance of CBIR in recent years. Most of the previous RF research can be categorized into the following four types: (a) query point movement, (b) feature relevance re-weighting, (c) machine learning based, and (d) memory learning based. Examples are as follows. An adaptive retrieval approach based on RF, implemented by adopting both query point movement and feature relevance re-weighting strategies, was G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 328–337, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Relevance Feedback Framework for Image Retrieval Based on Ant Colony Algorithm
329
proposed in [2]. It used users’ current feedback to set the query vector as the average of all relevant feature vectors, and also re-weighted them according to a set of statistical characteristics [3]. The feature re-weighting method was also employed to select appropriate features. However, due to the lack of memorizing mechanism, it still suffered from a drawback that no user could benefit from other users’ feedback information. Relevance feedback based on memory learning can overcome this drawback by capturing the relationship between low-level features and higher-level concepts as it accumulated the logs of users’ relevant feedbacks [4]-[6]. The retrieval performance can be improved, but it needs at least three different matrices to store the constructed semantic correlations based on users’ retrieval logs. When the number of images is n, the dimensions for each matrix will be n2, which heavily increases both memory space and computation burden for the RF method. To address the issues mentioned above, we proposed a relevance feedback framework for image retrieval by combining a “hidden” semantic network generated by using Ant Colony algorithm and low-level image features, as an improvement on our previous work [7]. In this framework, only one matrix is needed to store users’ relevance feedback in the semantic network constructing process. Then, a novel feature element re-weighting approach based on that semantic network and the user’s current feedback information are designed and implemented. Our experimental results demonstrate that the generated semantic network is able to learn the semantic meaning of the user’s feedback accumulatively, which makes the proposed framework be able to help users retrieve images more effectively. The main contributions of this paper are summarized as follows: 1. A relevance feedback framework based on Ant Colony algorithm, which integrates both semantic network representation and low-level features, is presented. It is effective and can be simply implemented. 2. No query image set is needed. The query vector can be generated automatically with no need for users to select any image as a query set. The generated query vector will move closer and closer to users’ goal as the feedback iteration increases. 3. A novel feature element re-weighting approach is also proposed. It is more effective than traditional feature re-weighting algorithm. The rest of this paper is organized as follows. Section 2 discusses the construction process of the semantic network, and also presents feature re-weighting strategy and the details of the proposed framework for relevance feedback. In Section 3, experimental results are illustrated and analyzed. Finally, conclusion remarks are provided in Section 4.
2
A Framework for Relevance Feedback Based on Memory Learning
In this section, we present a framework for relevance feedback based on memory learning. Firstly, we discuss how to accumulate users’ feedback information to construct a “hidden” semantic network by using Ant Colony algorithm. Then, a feature element re-weighting strategy and the architecture of the framework are presented. The basic assumption of this framework is that two images share similar semantic meanings if they are labeled as “satisfactory”, or “relevant” in the relevance feedback iteration.
330
2.1
G.-P. Chen et al.
Semantic Network Construction Based on Ant Colony Algorithm
Ant Colony Optimization, a classic simulated evolutionary algorithm, was proposed first by Dorigo M. et al in [8] and [9]. The basic principles and mathematical models of this algorithm were described exhaustively in [10], which stirred a great deal of concerns and applications about this method [11]. Ant Colony algorithm is a simulation of ant foraging behavior in nature. A kind of chemical irritants – pheromone, which the ant has secreted, will be left on the path where they forage. The more ants go through, the denser pheromone will be left. And the path with denser pheromone will be more attractive to other ants. Besides, pheromone on the path will gradually evaporate over time. It is not difficult to find that this algorithm seeks the optimal solution by the evolution of the group of candidate solutions. Semantic-based image retrieval can be naturally seen as a process of Ant Colony Optimization. Relevance between each resulted image is the candidate solution which needs to be evolved. Each user is seen as an ant, and the process of image retrieval can be considered as a foraging process conducted by an ant. The user retrieves images based on the “pheromone” that previous users left. When the retrieval iteration is completed, new pheromone will be left as the output of this user’s relevance feedback. As the users’ relevance feedback accumulates, a “hidden” semantic network describing the relationships among each image’s semantic meaning is then constructed gradually as follows. Firstly, a matrix pheromone is generated based on users’ relevance feedback in order to store the semantic correlations among images. Assume N is the total number of images in the database, the dimension of Matrix pheromone is N×N, where pheromone(i, j ) ∈ [0,1] , denoting the semantic correlation between image i and j, and 1 ≤ i ≤ N , 1 ≤ j ≤ N . Matrix pheromone is then initialized as follows:
⎧1, i = j pheromone(i, j ) = ⎨ ⎩0, i ≠ j
(1)
Considering the symmetry of the semantic correlation, a triangular matrix is sufficient to store all of the needed information. When a query is completed by a user, the pheromone matrix will be updated according to the following relevance feedback iterations. For instance, after the tth query, the user select image i and j as “relevant” results in the following relevance feedback iteration, pheromone will then be left between these two images. It means that image i and j share similar semantics, and the matrix pheromone is updated according to Eq. (2).
pheromone(i, j ) t = pheromone(i, j ) t −1 +
μ (1 − pheromone(i, j )t −1 length
(2)
where μ ( 0 < μ < 1 ) is the pheromone growth factor length is the number of images, which were selected as “relevant” results in the relevance feedback iteration. The pheromone will evaporate slowly over time. Therefore, pheromone matrix will be regularly updated according to Eq. (3).
A Relevance Feedback Framework for Image Retrieval Based on Ant Colony Algorithm
⎧η ⋅ pheromone(i, j ) t-1 pheromone(i, j ) t = ⎨ ⎩ pheromone(i, j ) t-1
if i ≠ j if i = j
331
(3)
where η ( 0 < η < 1 ) denotes pheromone attenuation factor. In this way, pheromone matrix will be updated iteratively. It makes semantic information contained in the matrix be closer to users’ current interest. Also, it can gradually reduce the effect caused by users’ erroneous feedback information. After several training iterations, this pheromone matrix will contain the semantic correlations existing in most of the images, which can be fuzzily clustered to construct a “hidden” semantic network describing the semantic categories of the image database. Herewith, we may give the definition of fuzzy cluster center of an image set as follows. Definition 1. Image i is assigned as a fuzzy cluster center of an image set, if and only if the following conditions are satisfied :
sum _ pheromone(i ) > sum _ pheromone( j ) , if pheromone(i, j ) > 0 . where N
sum _ pheromone(i ) = ∑ pheromone(i, k )
(4)
k =1
A cluster center contains a number of different semantic information. According to the above definition, images with the same pheromone value to the cluster center certainly share one or more same semantic meanings with the cluster center as well [7]. Finally, as the users’ relevance feedback is accumulated iteration by iteration, the generated pheromone matrix records a constructed semantic network that reveals the semantic correlations in the entire image database, based on which semantic-based image retrieval can be conducted. 2.2
Feature Element Re-weighting
In this section, we describe a feature element re-weighting approach based on the generated pheromone matrix mentioned above, aiming to combine it with the low level feature information. Although the low-level features, such as color, texture and shape, are not able to represent semantic information exactly, they can still reflect the semantic similarities among images in some aspects. However, it is difficult to directly adopt those low level features in the retrieval process, as not every element in the feature vector is representative and discriminative. Usually, only a part of elements in those feature vectors contributes to their semantic similarities, to which we should assign higher weights while neglecting other elements. For instance, there are two low-level feature vectors {a, b, c} and {d, e, f}. For one category of images, only a, b and e are similar in their values but c, d and f are not. Consequently, if we use either of the above two vectors, the retrieval results may not be accurate. Since only a, b, and e are representative for this category, they should be assigned higher weights while c, d and f should be neglected.
332
G.-P. Chen et al.
To achieve this goal, we define a “global vector” and thus design a novel feature element re-weighting strategy to determine how the low level features are used in the retrieval process. Each feature of an image is represented in the first stage by a vector. We then further combine all those feature vectors to form a new one-dimensional vector, which is defined as global vector. Therefore, each image can be represented by only one global vector. For instance, if there are two vectors: {a,b} and {c,d}, the global vector is formed as {a,b,c,d}. It can be seen that the re-weighting strategy in this paper is on element basis, rather than on feature basis. Afterwards, the feature element re-weighting strategy is then implemented. First, we define “positive” image set AISt in the tth feedback iteration as AISt={ a | a is the image which is labeled by user as “relevant”}
(5)
Obviously, the images in AISt are those which are the closest to the user's query. Then, semantic relevant image set SRISt in the tth feedback iteration is defined as: SRISt ={ a |
∏
pheromone(a, b) > 0 }.
b∈AISt
(6)
It indicates that the image with non-zero pheromone value in the pheromone matrix among all images in AISt belongs to SRISt. Moreover, this result is generated by combining the choice made by the current user and the relevance feedback from other users recorded in pheromone matrix. By using the feature element re-weighting strategy defined in Eq. (7), the weight of each feature element is dynamically updated to accommodate its relative importance in the retrieval process. In our re-weighting strategy, the weight pwt(i) for the ith feature element in the tth iteration is computed as:
1 D( fi )
pwt ( i ) =
(7)
where fi is the ith element of global vector f. D(fi) is the variance of fi for the images in AISt and SRISt, which denotes the importance of the ith element of global vector in AISt and SRISt. Afterwards, the pheromone weights can be normalized. If pwt(i) does not rank in top 10, let pwt(i)=0. The normalized feature element weight is then defined as:
pwt′ (i ) =
pwt (i ) M
∑ pw (i) i =1
(8)
t
where M is the total number of low-level feature elements. Obviously, there are many irrelevant feature elements if no feature selection process is applied, which definitely increases the computational burden and has negative effects on the retrieval result. By adopting the proposed feature element re-weighting strategy, the retrieval algorithm not only re-weights feature elements, but also plays an important role in feature element selection. Thus, the computational complexity is greatly decreased.
A Relevance Feedback Framework for Image Retrieval Based on Ant Colony Algorithm
333
We can then set the query feature vector qt as :
qt =
1 n ∑ ai n i =1
where ai is the global vector of the ith image in
(9)
AISt ∪ SRISt and n is the size
AISt ∪ SRISt , qt is the average of images’ global vector belonging to AISt ∪ SRISt in the tth iteration. This method was theoretically proven in [12] and
of
used in [2] and [13] respectively. Because AISt and SRISt are dynamically updated, qt is generated automatically during each iteration, and will move closer and closer to users’ goal as the feedback iteration increases. According to this query feature vector qt , we can further define “Feature Similar Image Set” in Def. 2. Definition 2. FSISt is Feature Similar Image Set in the tth iteration. The images in FSISt are ranked by Pheromone Weighted Euclidean Distance defined as: M
PWED(qt )= ∑ (qti - ai )2 pwt′(i)
(10)
i=1
where M is the total number of feature elements, qti is the ith element in Query Feature Vector qt , and ai is the ith feature element in image a, which is in the image database but not in AISt or SRISt. In the above re-weighting strategy, query vector is accommodated on feature element basis, based on the “hidden” semantic network constructed by using ant colony algorithm and user’s current feedback information, which makes it more precise than similar approaches. 2.3
Architecture of the Relevance Feedback Framework
Based on the above methods, we describe the architecture of our relevance feedback framework as shown in Table 1. As a user start image retrieval process, the retrieval system will firstly return each fuzzy cluster center of all the images to the user. The user then select images considered to be “satisfactory” or “relevant” to the query request, which is handled as the user’s current feedback information to form the positive image set AISt, and further generate the semantic relevant image set SRISt by pheromone matrix. Then, the query feature vector qt is formed by using Eq. (9). Afterwards, the feature similar image set FSISt is achieved according to Def. 2. The images in AISt , SRISt and FSISt are then returned to the user as the retrieval results orderly. If the user is satisfied with the result, the most “satisfactory”, or “relevant” images will be selected and the retrieval system will update the pheromone matrix accordingly by using Eq. (1). Otherwise, the retrieval process will be performed iteratively, guided by the user’s current feedback and the pheromone matrix.
334
G.-P. Chen et al. Table 1. Architecture of the Framework for Relevance Feedback
Step 1. Step 2. Step 3. Step 4. Step 5. Step 6. Step 7. Step 8. Step 9.
3
Display each fuzzy cluster centers of images to the user. The user selects images which are considered to be “satisfactory” or “relevant” to the query. Accommodate the user’s current feedback information to form the positive image set AISt. Generate semantic relevant image set SRISt based on AISt and the pheromone matrix. The query feature vector qt is formed by using Eq. (9). Calculate pwt’(i) according to Eq. (7) and Eq. (8). Generate feature similar image set FSISt. The images in AISt , SRISt and FSISt are then returned to the user as retrieval results orderly. If the user is satisfied with the result, go to Step 9; else go to Step 2. The user selects the most “satisfactory”, or “relevant” images, and the retrieval system updates the pheromone matrix accordingly by using Eq. (1).
Experimental Results
In this section, experimental results are illustrated and analyzed to demonstrate the effectiveness of the proposed framework. 3.1
Experiment Setup
In the experiments, 1000 Corel images were adopted, which were widely used in CBIR research. It covers a variety of topics, such as “mountain”, “beach”, “elephant”, etc. A matrix with 1,000×1,000 dimensions was used to form the semantic network. To build the semantic network, we invited 10 human users to train this retrieval system. Each user is required to use the retrieval system 5 times with 5 different query requests. Experimental results were then drawn on the basis of the trained semantic network. As for the low level image features, color histogram and HSV are adopted. Moreover, pheromone growth factor μ was set to 0.5, and pheromone attenuation factor η was set to 0.1. The classical Precision-Recall benchmark was used as the performance evaluation metric, and the top 15 returns were seen as the retrieval result. 3.2
Results and Analysis
A complete image retrieval process based on our framework is shown in Fig. 1. In this example, the query objective is “mountain”, and four relevance feedback iterations are conducted. As can be seen in Fig. 1-(a), the original fuzzy cluster centers of the images are firstly provided to the user. The user then chooses the ninth image as “mountain”, and retrieves the image set again. In this iteration, this cluster center is submitted as query image, and the results are shown in Fig. 1-(b). At this moment, the first iteration is completed. It can be seen that there are four relevant images in the results. The user chooses all four images, and retrieves again. The system calculates the global vector of
A Relevance Feedback Framework for Image Retrieval Based on Ant Colony Algorithm
335
those four images, which is considered as the second iteration. The results are presented in Fig. 1-(c). Similarly, the results of the third and fourth iterations are shown in Fig. 1-(d) and Fig. 1-(e), respectively. The result shown in Fig. 1-(e) finally meets the user’s query demand and the user completes the retrieval process by selecting the most “satisfactory”, or “relevant” images. The system then updates the pheromone matrix accordingly. Fig. 2 shows the precision curves for different retrieval tasks including “mountain”, “beach” and “elephant”. As we can see from the experimental results, the precision of our framework increases rapidly in the first 2 iterations. And it achieves a higher value after 3 or 4 iterations. All in all, as the number of feedback iteration increases, our framework performs better and better, for the constructed semantic network is able to help users be closer to their query request gradually. To demonstrate the effectiveness of the proposed framework, we also compared its performance with classical methods based only on low level features. Fig. 3 shows the performance comparisons between the proposed framework (Ant Colony Algorithm, ACA) and Color Histogram based method (CH), and HSV color based method (HSV). Three query, “mountain”, “beach” and “elephant” are independently conducted on the same image set. The precision of the proposed framework is calculated as the retrieval precision after four feedback iterations. As can be seen from Fig. 3, with the help of pheromone matrix, which accommodates users’ relevance feedback preferably, our framework improves the image retrieval performance greatly, and is much better than low level feature based methods.
(a)
(c)
(b)
(d)
(e)
Fig. 1. Image retrieval examples for “mountain”. (a) the original fuzzy cluster centers, (b) the results of the first iteration, (c) the results of the second iteration, (d) the results of the third iteration, (e) the results of the fourth iteration.
336
G.-P. Chen et al.
1 0.8 n o i 0.6 s i c e r 0.4 P 0.2 0 0
1
2
3
Iteration number
(a)
4
1 0.9 0.8 n 0.7 o 0.6 i s i 0.5 c e 0.4 r P 0.3 0.2 0.1 0
1 0.9 0.8 n 0.7 o i 0.6 s i c 0.5 e 0.4 r P 0.3 0.2 0.1 0 0
1
2
Iteration Number
3
0
1
2
3
4
Iteration Number
(b)
(c)
Fig. 2. Performance evaluation. (a) the Precisions for “mountain”; (b) the Precisions for “beach”; (c) the Precisions for “elephant”.
1 0.9 0.8 0.7 n o 0.6 i s i 0.5 c e 0.4 r P 0.3 0.2 0.1 0
ACA CH HSV
mountain
beach
elephant
Fig. 3. The histogram of our Ant Colony algorithm-based image retrieval framework (ACA) compared with Color Histogram based method (CH) and HSV color space based method (HSV) with three independent queries: “mountain”, “beach” and “elephant”.
4
Conclusions
This paper proposes a novel relevance feedback framework for image retrieval based on Ant Colony algorithm, by accumulating users’ feedback to construct a semantic network aiming at achieving “memory learning” in image retrieval process. The proposed relevance feedback framework adopts both the generated semantic network and the extracted image features, and re-weights them in similarity calculation to obtain more accurate retrieval results. The irrelevant feature elements are discarded to avoid the disturbance to the retrieval process and the computational complexity can also be greatly reduced. Experimental results are illustrated to demonstrate the efficiency and effectiveness of the proposed framework. However, currently the framework needs a lot of training to make it stable. In the future, we will further improve this framework on reducing the training requirements.
A Relevance Feedback Framework for Image Retrieval Based on Ant Colony Algorithm
337
Acknowledgements. We would like to acknowledge the supports from the National Science Foundation of China (Grant Nos. 60875011, 60975043, 61035003, 60723003, 61021062), the National 973 Program of China (Grant No. 2010CB327903), the Key Program of National Science Foundation of Jiangsu, China (Grant No. BK2010054), and the International Cooperation Program of Ministry of Science and Technology, China (Grant No. 2010DFA11030).
References 1. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR) 40(2), 1–60 (2008) 2. Grigorova, A., De Natale, F.G.B., Dagli, C., Huang, T.S.: Content-Based Image Retrieval by Featrue Adaptation and Relevance Feedback. IEEE Transactions On Multimedia 9(6), 1183–1191 (2007) 3. Wu, Y., Zhang, A.: A feature re-weighting approach for relevance feedback in image retrieval. In: Proc. IEEE Int. Conf. Image Processing 2002, vol. II, pp. 581–584 (2002) 4. Li, M., Chen, Z., Zhang, H.: Statistical correlation analysis in image retrieval. Pattern Recognition 35, 2687–2693 (2002) 5. Han, J., Ngan, K.N., Li, M., Zhang, H.-J.: A Memory Learning Framework for Effective Image Retrieval. IEEE Trans. On Image Processing 14(4), 511–524 (2005) 6. Shyu, M., Chen, S., Chen, M., Zhang, H., Shu, C.: Probabilistic semantic network-based image retrieval using MMM and relevance feedback. Springer Journal of Multimedia Tools and Applications 13(2), 50–59 (2006) 7. Chen, G., Yang, Y.: Memory-type Image Retrieval Method Based on Ant Colony Algorithm. Journal of Frontiers of Computer Science and Technology 5(1), 32–37 (2011) (in Chinese) 8. Colorni, A., Dorigo, M., Maniezzo, V., et al.: Distributed optimization optimization by ant colonies. In: Proceedings of the 1st European Conference Artificial Life, pp. 134–142 (1991) 9. Dorigo, M.: Optimization,learning and natural algorithm. Ph.D. Thesis, Department of Electronics, Politecnico diMilano, Italy (1992) 10. Dorigo, M., Maniezzo, V., Colorni, A.: Ant System:optimization by a colony of cooperating agents. IEEE Transaction on Systems, Man, and Cybernetics-Part B 26(1), 29–41 (1996) 11. Haibin, D.: Ant Colony Algorithms. Theory and Applications. Science Press, Beijing (2005) 12. Ishikawa, Y., Subramanya, R., Faloutsos, C.: Mindreader: Query databases through multiple examples. In: Proc. 24th Int. Conf. Very Large Databases, pp. 218–227 (1998) 13. Rui, Y., Huang, T.S.: Optimizing learning in image retrieval. In: Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 236–243 (2000)
A Closed Form Algorithm for Superresolution Marcelo O. Camponez, Evandro O.T. Salles, and Mário Sarcinelli-Filho Graduate Program on Electrical Engineering, Federal University of Espirito Santo Av. Fernando Ferrari, 514, 29.075-910, Vitória, ES, Brazil [email protected] http://www.ele.ufes.br
Abstract. Superresolution is a term used to describe the generation of highresolution images from a sequence of low-resolution images. In this paper an algorithm proposed in 2010, which gets superresolution images through Bayeasian approximate inference using a Markov chain Monte Carlo (MCMC) method, is revised. From the original equations, a closed form to calculate the high resolution image is derived, and a new algorithm is thus proposed. Several simulations, from which two results are here presented, show that the proposed algorithm performs better, in comparison with other superresolution algorithms.
1 Introduction The objective of superresolution (SR) is to merge a sequence of low-resolution (LR) images, which represent the same scene, in a single high resolution (HR) image. The motivation to study superresolution is that for many applications demanding high resolution images, like remote sensing, surveillance, medical imaging and the extraction of still images from a video, to increase the resolution through improved image acquisition sensors is not feasible because of the additional cost. Thus image processing techniques to improve the image resolution plays an important role in many applications. Superresolution has been a very active area of research since Tsai and Huang [1] published a frequency domain approach. Frequency domain methods are based on three fundamental principles: i) the shifting property of the Fourier transform (FT); ii) the aliasing relationship between the continuous Fourier Transform (CFT) and the Discrete Fourier Transform (DFT); and iii) the original scene is band-limited. These properties allow the formulation of a system of equations relating the aliased DFT coefficients of the observed images to samples of the CFT of the unknown scene. These equations are solved yielding the frequency domain coefficients of the original scene, which may then be recovered by inverse DFT. Since then, several extensions to the basic Tsai-Huang method have been proposed [2], [3], [4] and [5]. Some limitations in the frequency domain methods, as the limited ability to include a priori knowledge for regularization [6], caused the gradual replacement of such methods by spatial domain approaches. In the spatial domain SR reconstruction methods the observation model is formulated and reconstruction is effected in the spatial domain. Several algorithms have been proposed, such as Interpolation of NonUniformly Spaced Samples [7], Iterated Backprojection [8], projection onto convex sets (POCS) [9], [10], [11], for instance. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 338–347, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Closed Form Algorithm for Superresolution
339
SR reconstruction is an example of an ill-posed inverse problem, since multiple solutions exists for a given set of observation images. Because of this, TikhonovArsenin regularized SR reconstruction methods have been examined [12]. The regularizing functionals characteristic of this approach are typically special cases of Markov random field priors in the Bayesian framework. Stochastic Bayesian methods which treat SR reconstruction as a statistical estimation problem have rapidly gained prominence since they provide a powerful theoretical framework for the inclusion of a-priori constraints necessary for a satisfactory solution of the ill-posed SR inverse problem. These methods, in general, maximize the posterior probability distribution (Maximum A-Posteriori - MAP) [13], [14], [15], [16]. Recently, many studies have proposed methods based on Bayesian framework and approaches, such as Evidence approximation, Laplace approximation, Variational Bayes using Expectation Maximization, Expectation Propagation, MCMC – Markov chain Monte Carlo. In [17] Tipping and Bishop used Evidence approximation to resolve the problem of SR. In this approach, they found an expression for the marginal posterior probability distribution function - pdf conditioned to data and hyperparameters. They found a closed expression for the HR image, but, instead of using it they used an optimization scaled conjugate gradient algorithm to find the hyperparameters and the HR image. In 2010, Jing Tian and Kai-Kuang Ma [18] developed an algorithm based on MCMC to solve the SR problem. Unlike [17] they proposed a hierarchical Bayesian modeling and the image pdf prior is a GMRF - Gaussian Markov Random Field. In this article, from the model developed in [18] a closed form for solving the SR problem is derived and explored in a new algorithm. Various experiments have shown that this algorithm has superior performance, in comparison with those in [18] [19] and [20]. To discuss such proposal, the paper has three more sections. In Section 2 the Bayesian inference formulation for the SR process is mathematically derived, and a closed form for merging the HR image is developed. In turn, Section 3 presents some simulations and their results, and finally, Section 4 highlights some conclusions.
2 The New Approach The use of Bayesian inference has increased as a tool to solve the superresolution problem. Several methods based on such framework have been recently proposed, like the ones in [17] and [18]. Therefore, to the extent of the authors knowledge all the available approaches are iterative ones. Thus, the main contribution of this paper is just the proposal of a closed form approach based on Bayesian inference. In the next two subsections the observation model and the hierarchical Bayesian inference model are described, which are the same adopted in other proposals. The difference of proposal is presented in Subsection 2.3, where a closed form solution is derived. 2.1 Observation Model The observation model describes the changes occurred in the original images during the acquisition process, and its observed data are low-resolution images. The model presented here follows the same notation of [18], and is described by
340
M.O. Camponez, E.O.T. Salles, and M. Sarcinelli-Filho
Yk = H k X + ε k ,
(1)
where Yk and X represent the k-th L1 × L2 low-resolution image and the M1 × M2 highresolution image, respectively; i.e., both are represented in the lexicographic-ordered vector form, with a size of L1L2 × 1 and M1M2 × 1, respectively. Hk is an L1L2 × M1M2 matrix, representing the above-mentioned warping (i.e., shift and rotation), convolving and downsampling operations, and εk is an L1L2 × 1 vector, representing the additive white Gaussian noise, with a zero-mean and variance σk2. The goal of superresolution algorithm is: based on the knowledge of low-resolution images Y = {y1, y2, ..., yp} to retrieve the high resolution image X. 2.2 Joint Posterior Probability Density Function In this section the mathematical model that describes the conditional high resolution image X and the hyperparameters λ pdf, conditioned to a set of low-resolution images Y is derived by applying Bayes rule, as follows p ( X , λ , Y ) = p (Y | X , λ ) p ( X , λ ) = p (Y | X , λ ) p ( X | λ ) p (λ ) ,
(2)
resulting in p ( X , λ | Y ) ∝ p (Y | X , λ ) p ( X | λ ) p (λ ) .
(3)
where p (Y | X , λ ) is the data conditional pdf conditioned to the HR image and the hyperparameter λ , p ( X | λ ) is the HR image priori pdf given the hyperparameter λ , and p (λ ) is the hyperparameter pdf. Assuming, in the first term of (3), that the low-resolution images are independently obtained from the original (high-resolution) and that Y does not depend on λ , the conditional pdf p(Y | X, λ ) can be expressed as p
p(Y | X ) =
∏ p( y
k
| X),
(4)
k =1
where ⎛ 1 p( y k | X ) ∝ exp⎜ − y − Hk X ⎜ 2σ 2 k k ⎝
2⎞
⎟, ⎟ ⎠
(5)
resulting in ⎛ p(Y | X ) ∝ exp⎜ − ⎜ ⎝
p
∑ 2σ k =1
1 2 k
yk − H k X
2⎞ ⎟
⎟ ⎠
.
(6)
The second term of equation (3) is, in general, a locally smooth field. The Gaussian Markov random field (GMRF) [21] is considered as a reasonable approximation of the prior image model in this paper, which bears the mathematical form [22]
A Closed Form Algorithm for Superresolution
341
n
1 ⎛ 1 ⎞2 ⎛ 1 ⎞ T p( X | λ ) = ⎜ ⎟ λQ 2 exp⎜ − λX QX ⎟ , 2 π 2 ⎠ ⎠ ⎝ ⎝
(7)
where Q is a M1M2 × M1M2 matrix whose entries, considering a 4 neighborhood, as in [22], are given by:
Qij =
4,
i = j;
-1,
i and j are adjacent in the 4-neighborhood;
0,
otherwise.
(8)
The last term in equation (3), the hyperparameter pdf, has been defined as a uniform distribution, which has been proved to be a reasonable assumption for the image reconstruction problem [23], [24]. This means that
p (λ ) =
λ max
1 . − λ min
(9)
Finally, introducing (6), (7) and (9) in (3) one gets n
p 1 ⎛ 1 1 1 ⎛ 1 ⎞2 p( X , λ | Y ) ∝ yk − H k X ⎟ λQ 2 exp⎜ − λX T QX − ⎜ 2 ⎜ (λmax − λmin ) ⎝ 2π ⎠ k =1 2σ k ⎝ 2
∑
⎞ . ⎟ ⎠
2⎟
(10)
2.3 Derivation of SR Closed Form
In this section, from equation (10), a closed form is derived for calculating the HR image, as follows. The starting point is 1 ⎛ 1 p( X , λ | Y ) ∝ λQ 2 exp⎜ − λX T QX − ⎜ 2 ⎝
p
∑ 2σ k =1
1 2
yk − H k X
k
2
⎞ ⎟, ⎟ ⎠
(11)
where
= ( y k − H k X )T ( y k − H k X ) ,
(12)
= yk T yk − 2 yk T H k X + X T H k T H k X .
(13)
yk − H k X
2
or
yk − H k X
2
Considering the variables mk = H k T H k ,
(14)
bk = −2 H k T y k ,
(15)
ck = y k T y k ,
(16)
342
M.O. Camponez, E.O.T. Salles, and M. Sarcinelli-Filho
one gets yk − H k X
2
= X T mk X + bk T X + c k .
(17)
From such result, p
∑y
− Hk X
k
2
p
=
k =1
p
∑X
T
mk X +
k =1
∑b k =1
p
T k
X+
∑c
k
,
(18)
k =1
whose terms are developed as p
∑X
T
mk X = X T m1 X + X T m2 X + ... + X T m p X = X T ( m1 + m2 + ... + m p ) X ,
(19)
k =1
p
∑X
T
mk X = X T ΜX ,
(20)
k =1
p
∑m
Μ=
,
k
(21)
k =1
p
∑b
k
T
(
X = b1T X + b2 T X + ... + b p T X = b1 + b2 + ... + b p
)T X = ΒT X ,
(22)
k =1
p
∑b
Β=
k
,
(23)
k
.
(24)
k =1 p
C=
∑c k =1
Now, introducing (21), (23) and (24) in (18), and suposing that all low-resolution images have the same variance, one gets p
∑ 2σ k =1
1 2
yk − H k X
k
2
=
1 2σ 2
( X T Μ X + ΒT X + C ) ,
(25)
and, introducing (25) in (11), p ( X , λ | Y ) ∝ λQ
1 2
⎧⎪ ⎛ 1 Μ ⎞ C ΒT ⎛ exp⎨− ⎜ X T ⎜ λQ + 2 ⎟ X + X+ 2 ⎜ σ ⎠ 2σ 2σ 2 ⎪⎩ ⎝ 2 ⎝
⎞⎫⎪ ⎟⎬ . ⎟⎪ ⎠⎭
(26)
Replacing variables as follows BT =
ΒT 2σ 2
,
(27)
A Closed Form Algorithm for Superresolution
Κ=
C
,
2σ 2
Μ ⎛ A = ⎜ λQ + 2 σ ⎝
343
(28)
⎞ ⎟, ⎠
(29)
one gets p ( X , λ | Y ) ∝ λQ
1 2
⎧ ⎛1 ⎞⎫ exp ⎨− ⎜ X T AX + B T X + Κ ⎟⎬ , 2 ⎠⎭ ⎩ ⎝
(30)
which is a well-known equation in the literature. From [24] the identity
(
) (
)
(31)
(
) (
)
(32)
T 1 T 1 1 X AX + B T X + Κ = X + A −1 B A X + A −1 B + K − B T A −1 B , 2 2 2
is valid, and thus p ( X , λ | Y ) ∝ λQ
1 2
T 1 ⎧ 1 ⎫ exp⎨− X + A −1 B A X + A −1 B − K + B T A −1 B ⎬ . 2 2 ⎩ ⎭
From (32) one can notice that if λ is given, p (X | Y, λ) is a Gaussian function. Thus, __
(33)
X = − A −1 B ,
which, associated to ⎛ H 1T H 1 + H 2 T H 2 + ... + H p T H p X = ⎜ λQ + ⎜ σ2 ⎝
__
⎞ ⎟ ⎟ ⎠
−1
⎛ H 1T y1 + H 2 T y 2 + ... + H p T y p ⎜ ⎜ σ2 ⎝
⎞ ⎟ ⎟ ⎠
(34)
corresponds to a closed form for calculating the HR image.
3 Simulation Experiments 3.1 Generating the Low Resolution Images
The aim of this section is to compare the performance of the proposed closed form algorithm with the performance of other algorithms, including the one in [18]. For doing that, the HR images, the procedure for the generation of LR images and the performance index (PSNR) here adopted are the same adopted in [18]. Thus, a 256 × 256 Boat and a 200 × 200 Text images are used as HR test images, and a set of sixteen LR images is generated from each one of them, as explained in the sequel. From such LR images, four experiments are run for the Boat image, as well as for the Text image, each one considering four LR images. To generate the LR images, a shift operation is firstly applied to each original image, with the shift amount randomly drawn from a continuous uniform distribution
344
M.O. Camponez, E.O.T. Salles, and M. Sarcinelli-Filho
over the interval (-2, 2) in pixels, in both directions, independently chosen. Each resulting image is then multiplied by the D matrix that represents the degradations incurred in the acquisition process, and after that a decimation factor of two in both, horizontal and vertical directions, is applied. Finally, a zero-mean white Gaussian noise with a standard deviation of 8 is added to each processed image to yield a noisy low-resolution image. The above-mentioned steps are independently carried out sixteen times, to generate sixteen low-resolution images from each test image. 3.2 Experimental Results
The amount of shift, as well as the point spread function, adopted for generating the simulated low-resolution images are assumed to be known in advance or to be accurately estimated (see, for instance, [19] and [25]). The HR images are generated using (34), with the hyperparameters λ being adjusted, after various simulations, to λ = 0.001 (Boat image) and λ = 0.0004 (Text image). Such algorithm has been programmed in MATLAB©, and the matrix inversion include in (34) is performed using Cholesky decomposition [22], which makes faster to calculate such invertion. The proposed SR image reconstruction approach is compared with the bi-cubic interpolation and other SR approaches developed in [18], [19] and [20], with the same parameter setting. All these approaches were implemented, except the one in [18], because the input data and the results were yet available from the paper, once the case studies use the same images and performance metric (PSNR). The results correspondent to the four methods are presented in Table 1 and Fig. 1, and show that the approach here proposed yields the best values for the PSNR metric, thus meaning that our method outperforms all the others. Table 1. Evaluation of a reconstructed high-resolution image considering PSNR (dB)
Test Image
Boat
Text
Run
1 2 3 4 Average 1 1 3 4 Average
Bi-cubic MCMCProposed SR spline interpo- Vandewalle Pham et al. approach approach et al. [19] [20] lation [18] 20,29 22,77 25,05 20,77 22,22 13,15 14,52 16,43 12,81 14,23
24,88 24,73 24,73 24,97 24,83 16,61 16,41 16,32 16,34 16,42
27,20 26,98 27,15 26,79 27,03 17,93 17,51 17,57 17,51 17,63
28,02
28,02 20.17
20.17
30,18 30,13 30,21 30,22 30,19 22,08 22,05 21,89 22,16 22,05
A Closed Form Algorithm for Superresolution
345
4 Concluding Remarks In this paper, a new approach to the problem of superresolution is proposed. Starting from a Hierarchical Bayesian model, where the prior is a GMRF - Gaussian Markov Randon Field, a closed form HR image fusion was derived. Various experiments are presented, showing that the proposed algorithm outperforms other state-of-the-art methods. As for its implementation, the algorithm was here programmed using MATLAB® and optimized with the use of sparse matrices. As a conclusion, the results so far obtained show that it is important to automate the choice of the hyperparameters λ , given the LR images, which is currently under development.
(a )
(a )
(b )
(b )
(c )
(c )
(d )
(d )
(e )
(e )
(f)
(f)
Fig. 1. Two sets of reconstructed images using two test images: Boat and Text (a) original image (ground truth); (b) simulated quarter-sized low-resolution image; (c) image generated applying a bi-cubic spline interpolation approach; (d) image generated by applying Vandewalle et al.’s approach [19]; (e) image generated by applying Pham et al.’s approach [20]; and (f) image generated by applying the SR approach proposed here.
346
M.O. Camponez, E.O.T. Salles, and M. Sarcinelli-Filho
References 1. Tsai, R.Y., Huang, T.S.: Multiframe image restoration and registration. In: Tsai, R.Y., Huang, T.S. (eds.) Advances in Computer Vision and Image Processing, vol. 1, pp. 317– 339. JAI Press Inc., Greenwich (1984) 2. Tekalp, A.M., Ozkan, M.K., Sezan, M.I.: High-resolution image reconstruction from lower-resolution image sequences and space-varying image restoration. In: ICASSP, San Francisco, vol. III, pp. 169–172 (1992) 3. Kim, S.P., Bose, N.K., Valenzuela, H.M.: Recursive reconstruction of high resolution image from noisy undersampled multiframes. IEEE Trans. ASSP 38(6), 1013–1027 (1990) 4. Kim, S.P., Su, W.-Y.: Recursive high-resolution reconstruction of blurred multiframe images. IEEE Trans. IP 2, 534–539 (1993) 5. Bose, N.K., Kim, H.C., Valenzuela, H.M.: Recursive Total Least Squares Algorithm for Image Reconstruction from Noisy, Undersampled Multiframes. Multidimensional Systems and Signal Processing 4(3), 253–268 (1993) 6. Borman, S., Stevenson, R.L.: Super-Resolution from Image Sequences - A Review. In: Midwest Symposium on Circuits and Systems (1998) 7. Komatsu, T., Igarashi, T., Aizawa, K., Saito, T.: Very high resolution imaging scheme with multiple different aperture cameras. Signal Processing Image Communication 5, 511–526 (1993) 8. Irani, M., Peleg, S.: Motion analysis for image enhancement: Resolution, occlusion and transparency. Journal of Visual Communications and Image Representation 4, 324–335 (1993) 9. Patti, A.J., Sezan, M.I., Tekalp, A.M.: Superresolution Video Reconstruction with Arbitrary Sampling Lattices and Nonzero Aperture Time. IEEE Trans. IP 6(8), 1064–1076 (1997) 10. Tom, B.C., Katsaggelos, A.K.: An Iterative Algorithm for Improving the Resolution of Video Sequences. In: SPIE VCIP, Orlando, vol. 2727, pp. 1430–1438 ( March 1996) 11. Eren, P.E., Sezan, M.I., Tekalp, A.: Robust, Object-Based High-Resolution Image Reconstruction from Low-Resolution Video. IEEE Trans. IP 6(10), 1446–1451 (1997) 12. Hong, M.-C., Kang, M.G., Katsaggelos, A.K.: A regularized multichannel restoration approach for globally optimal high resolution video sequence. In: SPIE VCIP, San Jose, vol. 3024, pp. 1306–1316 (February 1997) 13. Schultz, R.R., Stevenson, R.L.: Extraction of high-resolution frames from video sequences. IEEE Trans. IP 5(6), 996–1011 (1996) 14. Cheeseman, P., Kanefsky, B., Kraft, R., Stutz, J., Hanson, R.: Super-resolved surface reconstruction from multiple images. In: Maximum Entropy and Bayesian Methods, pp. 293–308. Kluwer, Santa Barbara (1996) 15. Hardie, R.C., Barnard, K.J., Armstrong, E.E.: Joint MAP Registration and HighResolution Image Estimation Using a Sequence of Undersampled Images. IEEE Trans. IP 6(12), 1621–1633 (1997) 16. Tom, B.C., Katsaggelos, A.K.: Reconstruction of a high resolution image from multiple degraded mis-registered low resolution images. In: SPIE VCIP, Chicago, vol. 2308, pp. 971–981 (September 1994) 17. Tipping, M.E., Bishop, C.M.: Bayesian image super-resolution. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Nueral Information Processing Systems, vol. 15. MIT Press, Cambridge (2003) 18. Tian a, J., Ma, K.-K.: Stochastic super-resolution image reconstruction. J. Vis. Commun. Image R, R 21, 232–244 (2010)
A Closed Form Algorithm for Superresolution
347
19. Vandewalle, P., Susstrunk, S., Vetterli, M.: A frequency domain approach to registration of aliased images with application to super-resolution. EURASIP Journal on Applied Signal Processing (2006) 20. Pham, T.Q., van Vliet, L.J., Schutte, K.: Robust fusion of irregularly sampled data using adaptive normalized convolution. EURASIP Journal on Applied Signal Processing (2006) 21. Li, S.Z.: Markov Random Field Modeling in Computer Vision. Springer, New York (1995) 22. Rue, H.: Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall, Boca Raton (2005) 23. Galatsanos, N.P., Mesarovic, V.Z., Molina, R., Katsaggelos, A.K.: Hierarchical Bayesian image restoration from partially known blurs. IEEE Transactions on Image Processing 9, 1784–1797 (2000) 24. Figueiredo, M., Nowak, R.: Wavelet-based image estimation: an empirical Bayes approach using Jeffreys’ noninformative prior. IEEE Transactions on Image Processing 10, 1322–1331 (2001) 25. Bishop Christopher, M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 26. He, Y., Yap, K.-H., Chen, L., Chau, L.-P.: A soft MAP framework for blind superresolution image reconstruction. Image and Vision Computing 27, 364–373 (2009)
A Parallel Hybrid Video Coding Method Based on Noncausal Prediction with Multimode* Cui Wang and Yoshinori Hatori Tokyo Institute of Technology
Abstract. This paper addresses the parallel computing problem of hybrid video coding method. In particular, we proposed a new adaptive hybrid video coding method of I-Frame based on noncausal prediction which has better parallel performance than traditional causal prediction. However, there is an inherent problem of noncausal prediction: the error will be expanding when decoded. In order to solve this problem, feedback quantization has also been applied. Another character of this method is that the transform and scan order can be updated according to the input images and quantized step. The simulation results show that the proposed method is 0.4-5dB superior to H.264 High complexity profile which uses RD technology.
1
Introduction
Currently, the hybrid coding method which combines the predictive coding with the orthogonal transform and the quantization is mainly used in H.26x family of coding standard and others. On the other hand, noncausal image coding model is proposed [1][2]. According to this method, the predictive value of pixel can be obtained by nearest neighbors pixels no matter whether these pixels have been coded or not. But there is also an inherent problem of noncausal prediction, that is, the error will be expanding when decoded due to the decoding process. Consequently, in order to solve this problem, feedback quantization [2] has also been applied in our research. In addition, transform coding technique is also a very important paradigm in many images and video coding standards, such as JPEG [3], MPEG [4], ITU-T [5]. In these standards, the Discrete Cosine Transform (DCT) [6][8] is applied due to its de-correlation and energy compaction properties. In 1980s, more contributions also focused Discrete Wavelet Transform (DWT) [7][8] for its efficiency performance in image coding. A proper transform can de-correlate the input samples to remove the spatial redundancy which exists in the image or video frame. In this paper, we applied three kinds of transform according to the different input images and quant step. The paper is organized as follows. Section 2 introduces the noncausal prediction process in this research, and gives the chart of hybrid coding. Section 3 discusses the different transforms applied in this paper, the multimode coding, and also focuses on the analysis of parallel computing time of proposed method. Section 4 compares proposed method with H.264 High complexity profile and gives the simulation results. Conclusion is given in Section 5. *
This work is supported by Global COE program and KAKENHI (23560436).
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 348–357, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Parallel Hybrid Video Coding Method Based on Noncausal Prediction with Multimode
2
349
Noncausal Prediction
Currently, almost all video compression schemes are block-based. For example, in H.264, which has been widely used as the video compression standards, there are five types of blocks: 16×16, 16×8, 8×8, 8×4 and 4×4. This research is also block-based, but only one type: 8×8+4. And the prediction process is based on matrix calculations. 2.1
A New Type of Block
Given a block whose boundary conditions were known, coding based on an interpolative model could be realized [2]. As a result, we have to know more pixels’ value than one block’s pixels. For example, in [9], it uses 9×9 block mode to predict an 8×8 block. The block structure is shown in Fig.1 (a). Now consider a simpler mode, as the purpose of using a larger block is to obtain the estimate of edge pixels, if we can keep and update the predicted value automatically, the size can be reduced. As shown in Fig.1 (b), we used 8×8+4 block type to complete interpolative prediction. A, B, C and D represent estimate of a, b, c and d (four corner pixels) based on the values of theirs nearest neighbor pixels. If a is the first pixel of frame, A= a; else according to the position of this block, A is the average value of the nearest four or two neighboring pixels of a (It depends on whether exist the nearest four or two (horizontal or vertical) neighboring pixels of pixel a). Values of B, C and D are obtained by the same way. 9
…
…
…
b
a
…
9
B
A
…
8 C
8
…
…
c
d D
8
(a)
8
(b)
Fig. 1. (a) Conventional block type of noncausal prediction, 9×9 block mode (b) Proposed block type of noncausal prediction, 8×8+4 block mode. A, B, C and D are not real pixels.
2.2
Interpolative Prediction
~
As shown in Fig.1 (b), we call 64 pixels inside the block as x1~x64. First, A D (the predicted values of four corner position pixels) are obtained by the pre-encoder and x1~x64 must wait d seconds as delay until A D have been calculated. Second, predict all the pixels in one block. The detailed prediction of insides pixels is as follows: The pixels in one block, x1~x64 are rearranged in a conventional order as a vector. Then, combined A~D values to this vector, as x = (A, B, C, D, x1, x2 x64)T.
…
350
C. Wang and Y. Hatori
Multiply vector x by predictive matrix C to get the prediction errors. Corresponding prediction error vector is Y = (A, B, C, D, y1, y2 y64)T. In this case, Y can be expressed by Eq.1. (When decoded, we can use C-1)
…
Y = Cx
(1)
Since the dimension of vector x and Y is 68×1, the prediction matrix C should be a 68×68 matrix, and the value of C is given by:
⎛ I ⎜A ⎜ 1 ⎜ C =⎜ ⎜ ⎜ ⎜A ⎝ 2
A3 A5
A4 A5 % % % A5 A4
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ A5 ⎟ A3 ⎟⎠
(2)
All the values of elements in matrix C which are not written above are equal to zero. I is the identity matrix; A1, A2 are the 8×4 matrices; and A3, A4, A5 are the 8×8 matric(5). es. The values of these matrices are given by Eq. (3)
~
⎛1 ⎜ A1 = ⎜ 0# ⎜0 ⎝
0 # 0 1
0 # # 0
0⎞ #⎟ #⎟ 0 ⎟⎠
⎞ ⎛ −1 ⎟ ⎜ 1 1 ⎟ ⎜− 1 − 2 ⎟ ⎜ 2 A3 = ⎜ % % % ⎟ ⎜ 1 1⎟ − 1 − ⎟ ⎜ 2 2⎟ ⎜ 1 ⎠ ⎝
⎛0 ⎜ A2 = ⎜ ## ⎜0 ⎝
1 0 # 0
0⎞ #⎟ 0⎟ 1 ⎟⎠
⎞ ⎛ 1 ⎟ ⎜ 1 1 ⎟ ⎜− 1 − 4 ⎟ ⎜ 4 A4 = ⎜ % % % ⎟ ⎜ 1 1⎟ − 1 − ⎟ ⎜ 4 4 ⎜ 1 ⎟⎠ ⎝
⎞ ⎛ − 1/ 2 ⎟ ⎜ − 1 / 4 ⎟ ⎜ ⎟ ⎜ A5 = % ⎟ ⎜ −1/ 4 ⎟ ⎜ ⎟ ⎜ −1/ 2 ⎠ ⎝ 2.3
0 # # 0
(3)
(4)
(5)
The Feedback Quantization
After predictive coding, the error will be transform coded. Transform will produce as many coefficients as there are pixels in the block. After that, the coefficients are quantized and the quantized values are transmitted. As we explained before, the error will
A Parallel Hybrid Video Coding Method Based on Noncausal Prediction with Multimode
351
expand when using noncausal prediction, therefore, the feedback quantizer module is used in our research, which is shown in Fig.2. The quantizer module is nonlinear quantization (please refer to Appendix).
x
i
yi
Point 1
P
OT
C
Interpolative prediction
Q
z
i
-
Scan order Quantizer
Feedback Quantizer
K−I
+
r
i
Memory
Fig. 2. Block diagram depicting the hybrid coding based on noncausal interpolative prediction including feedback quantization where: C is the predict matrix; OT means Orthogonal transform and P is the scan order matrix. In our model, input block is multiplied by the product of C, OT matrix and P, all matrices are 68 68; K is the feedback quantization matrix, I is identity matrix, 64 64. Q represents the processing of nonlinear quantization (See Appendix) and the memory saving the quantized coefficient. The adaptive coding part is not included. Point 1 will be explained in Section 5.
×
×
1 3 4 63 62 52 51 37
2 5 64 61 53 50 38 36
6 8 60 54 49 39 35 24
7 59 55 48 40 34 25 23
58 56 47 41 33 26 22 15
57 46 42 32 27 21 16 14
45 43 31 28 20 17 13 10
44 30 29 19 18 12 11 9
Fig. 3. In extend method, there are two kinds of parameters called changeover point c and cutoff point l. Changeover point c means when number of scan elements is greater than c, the scanning order will change to the reverse order. The cut-off point l means that there would be l pixels to force to 0 in the process of the quantized output data. Because of this method, the amount of pixels to be coded can be reduced. This is an example (c = 8, l = 32).
Before quantizing the transform coefficient yi, which corresponds to pixel i, we added a modulus to yi. This modulus is the product of all quantization error of pixel j (j=0, 1 ... i-1) and feedback factor kij. If we defined the error of quantizer is ri, the output of quantizer is zi, zi can be expressed by Eq.6, and the feedback factor matrix K is given by Eq.7. i −1
z i = y i + ∑ k ij r j + ri j =1
(6)
352
C. Wang and Y. Hatori
⎡1 ⎢k ⎢ 21 K = ⎢ k 31 ⎢ ⎢ # ⎢k n1 ⎣
0 " 1 0 k 32 1 #
" % %
" " k n ,n −1
0⎤ # ⎥⎥ #⎥ ⎥ 0⎥ 1⎥⎦
(7)
To improve coding efficiency, extension quantization technology is also used in our code model, shown by Fig.3. In our simulation, the best c and l are selected based on a large number of experiments.
3 3.1
Orthogonal Transform and Multimode Coding Orthogonal Transform
A proper linear transform can de-correlate the input samples to remove the spatial redundancy which exists in the image or video frame. From the basic concepts of information theory, coding of symbols in vectors is more efficient than in scalars [10]. In this paper, we used the following transformation techniques to improve the coding efficiency. • Discrete Cosine Transform The Discrete Cosine Transform is a widely used transform coding technique in image and video compression algorithms. The top left coefficient in each block is called the DC coefficient, and is the average value of the block. The right most coefficients in the block are the ones with highest horizontal frequency, while the coefficients at the bottom have the highest vertical frequency. • Discrete Sine Transform Discrete sine transform (DST) was originally developed by Jain [12], which belongs to the family of unitary transforms [13]. Since his introduction, the DST has found application in the modeling of random processes such that their KLT are fast transforms [12][13]. It is also used in image reconstruction [14] and in image coding [15]. • Discrete Wavelet Transform The basic idea of the wavelet transform is to represent any arbitrary function as a superposition of a set of such wavelets or basis functions. These basis functions or child wavelets are obtained from a single prototype wavelet called the mother wavelet, by dilations or scaling and translations. In this paper, we used the Haar wavelet scaling function. 3.2
Multimode Coding
In this paper, we proposed an adaptive model for hybrid encode with multimode. Table 1 shows all of coding modes in this research. The orthogonal transform and scan order can be adaptively changed according to the input image and the quantization step. I transform, in fact, means that transform coding is not used.
A Parallel Hybrid Video Coding Method Based on Noncausal Prediction with Multimode
353
Table 1. This is mode list of this research. There are 11 encoding modes in this study: {DCT, DST, DWT} + {zigzag, horizontal, vertical}, I + horizontal, DWT + special order (See Appendix).
Mode Number 0 1 2 3 4 5 6 7 8 9 10 3.3
Orthogonal Transform DCT DCT DCT DST DST DST DWT DWT DWT DWT I
Scan Order Zigzag Horizontal Vertical Zigzag Horizontal Vertical Zigzag Horizontal Vertical Special Horizontal
Parallel Computing Time of Proposed Method
After adding the multimode to the hybrid coding system, the diagram of our model can be expressed as in Fig.4. Pre-coder module is designed to get the predictive value of four corners, A~D, shown in Fig.1 (b). The noncausal hybrid coder here means one diagram of typical hybrid coding configuration, shown by Fig.3, but the transform and scan order are different. The number of best mode which has the least MSE (Mean Square Error) will be transmitted to decoder as overhead information. Because the multimode coding is used in our model, computational time is increased. However, the advantage of noncausal prediction is potential of high parallelism, so when I-Frame is encoded as several blocks, these blocks can be processed at the same time. In our model, the bottleneck of parallelism is the design of pre-coder, which is designed to get the predictive value of four corners in one block, as shown in Fig.1 (b). If it can be opportunely designed, for example, in an ideal system, encoding time for one I-Frame would be equal to encoding one block time add to delay d.
Pre-coder
d
Noncausal Hybrid Coder 1
MUX Mode1 Decoder
Noncausal Hybrid coder N
DMUX Comparison
Mode n Decoder
Fig. 4. Block diagram depict the coding model
Mode n Decoder
354
4
C. Wang and Y. Hatori
Simulation Results
We compared the performance of proposed method with H.264 high complexity profile (high 4:2:0), using four test images: the first frame of Foreman, QCIF (176 144) and CIF (352 288) size, Bus, QCIF and CIF size.
×
×
Table 2. The correlation of test images is shown in this table
Test Image Foreman_qcif Foreman_cif Bus_qcif Bus_cif 4.1
ρh (Horizontal) 0.9655 0.9726 0.875 0.8989
ρv (Vertical) 0.9335 0.9583 0.7757 0.8414
Comparison of Prediction Error
The prediction error of two methods are shown by Fig.5 and Only QCIF size images are compared in this experiment. Prediction error of proposed method is the data at point 1 in Fig.2, while of H.264 (high 4:2:0) is the data before the transform is applied. All values of errors here are rounded. According to Fig.5, it is clear that the error distribution of proposed is more uniform than H.264.
foreman_qicf , H.264
bus_qcif, H.264
foreman_qcif, PM
bus_qcif, PM
Fig. 5. X-axis expresses the pixel number, from 0 to 25344(176×144); Y-axis expresses the value of error. PM is an abbreviation of Proposed Method.
A Parallel Hybrid Video Coding Method Based on Noncausal Prediction with Multimode
355
Table 3. Statistical properties of two methods: PM is an abbreviation of Proposed Method; Numbers of 0 expresses how many pixels are accurately predicted in this frame; average error means average value of errors in this frame.
Test Image Foreman Bus 4.2
Numbers of 0 PM H.264 7675 1215 3592 761
Average error PM H.264 50.96 709.9 186.9 878.7
entropy PM H.264 3.965 6.299 5.343 6.581
Comparison of Coding Efficiency
Because there is no entropy coding module in proposed method, it is difficult to compare it with H.264. As a result, we dealt with data obtained by H.264. The PSNR here showed is directly calculated by JM model, the source based on H.264 standard, but the entropy here is calculated by other source code: We obtained the quantized data, used them to calculate the entropy based on Shannon theorem, and then added the overhead information of each macro-block, such as block type, predict mode. Of course, entropy of proposed method is also obtained in this way.
Fig. 6. Comparison results using four sequences. All points in these figures are obtained by changing the quantization step. It is shown that our model is relatively insensitive to changes in image statistics, gives higher PSNR for certain range of entropy.
356
C. Wang and Y. Hatori
4.3
Subjective Evaluation
(a) Original
(d) Original
(b) Proposed
(e) Proposed
(c) H.264
(f) H.264
Fig. 7. Decoded pictures of two methods are shown here. Only QCIF pictures are presented. Picture (a) and (d) is the original picture named foreman and bus. (b) is decoded picture which is coded by proposed method at 0.434 bit/pixel point; while (c) is at 0.420 bit/pixel point but be coded by H.264; picture (e) is the decoded picture at 1.034 bit/pixel, proposed method; while (f) is at 1.047 bit/pixel, H.264. According to these decoded images, it is clearly that proposed method could handle the details of the image better.
5
Conclusion
In this paper, a new hybrid video coding based on noncausal prediction method has been proposed, and various techniques have been investigated to improve the coding efficiency of this method, such as multimode coding, feedback quantization. Because using a prediction based on noncausal, it is possible to increase the parallelism if the encoding algorithm is appropriately designed. The key features of the coding system employed are noncausal prediction, feedback quantization and multimode for coding. After further research, specifically in inter-frame of video, noncausal prediction has a strong potential to become a very competitive parallel video coding method.
References 1. Jain, A.K.: Image Coding via a Nearest Neighbors Image Model. IEEE Transactions on Communications COM-23, 318–331 (1975) 2. Hatori, Y.: Optimal Quantizing Scheme in Interpolative Prediction. The Journal of the Institute of Electronics, Information and Communication Engineers J66-B(5) (1983) 3. Wallace, G.K.: The JPEG still picture compression standard. Communications of ACM 34(4), 31–44 (1991)
A Parallel Hybrid Video Coding Method Based on Noncausal Prediction with Multimode
357
4. Le Gall, D.: MPEG: A video compression standard for multimedia applications. Communications of ACM 34(4), 47–58 (1991) 5. Liou, M.: Overview of the px64 kbps video coding standard. Communications of ACM 34(4), 60–63 (1991) 6. Ahmed, N., Natarajan, T., Rao, K.R.: “Discrete Cosine Transform. IEEE Transactions on Communications COM-23, 90–93 (1974) 7. Antonini, M., Barlaud, M., Mathieu, P., Daubechies, I.: Image coding using wavelet transform. IEEE Transactions on Image Processing 1(2), 205–221 (1992) 8. Li, Z.-N., Drew, M.S.: Fundamentals of Multimedia. Pearson Education, New Delhi (2004) 9. Mori, S., Kubota, A., Hatori, Y.: Examination of Hybrid Coding Method by Interpolative Prediction and DCT Quantization. In: IEVC 2010, 2C-3, Nice, France (March 2010) 10. Shannon, C.E.: A Mathematical theory of Communication. Bell System Technical Journal 27, 623–656 (1948) 11. Jack, K.: Video Demystified. Penram International Publishing Pvt. Ltd., Mumbai (2001) 12. Jain, A.K.: Fast Karhunen-Loève transform for a class of stochastic processes. IEEE Trans. Commun. COM-24, 1023–1029 (1976) 13. Jain, A.K.: A sinusoidal family of unitary transforms. IEEE Trans. Pattern Anal. Machine Intell. PAMI-I, 356–365 (1979) 14. Cheng, S.: Application of The Sine-Transform Method in Time-of-Flight Positronemission Image Reconstruction Algorithms. IEEE Trans. Biomed. Eng. BME-32, 185–192 (1985) 15. Rose, K., Heiman, A., Dinstein, I.: ‘DCT/DST alternate-transform image coding. Presented at GLOBECOM 1987, Tokyo, Japan, November 15-18 (1987) 16. Sehgal, A., Jagmohan, A., Ahuja, N.: Wyner-Ziv Coding of Video: An Error-Resilient Compression Framework. Presented at IEEE Transactions On Multimedia 6(2) (April 2004)
Appendix Nonlinear quantization (t is quantization step) Special scan order of DWT 1 3 7 8 21 22 23 24
2 4 11 12 29 30 31 32
5 9 13 15 41 43 45 47
6 10 14 16 42 44 46 48
17 25 33 34 49 51 52 58
18 26 35 36 50 53 57 59
19 27 37 38 54 56 60 63
20 28 39 40 55 61 62 64
input output 0≤ x≤t 0 t ≤ x ≤ 3t 2 3t ≤ x ≤ 6t 4 6t ≤ x ≤ 11t 6 11t ≤ x ≤ 18t 8 18t ≤ x ≤ 28t 10 28t ≤ x ≤ 42t 12 42 ≤ x ≤ 58t 14 Then linear quantization 58 ≤ x ≤ 74t 66t
…
…
Color-Based Extensions to MSERs Aaron Chavez and David Gustafson Department of Computer Science Kansas State University Manhattan, KS 66506 {mchav,dag}@ksu.edu
Abstract. In this paper we present extensions to Maximally Stable Extremal Regions that incorporate color information. Our extended interest region detector produces regions that are robust with respect to illumination, background, JPEG compression, and other common sources of image noise. The algorithm can be implemented on a distributed system to run at the same speed as the MSER algorithm. Our methods are compared against a standard MSER baseline. Our approach gives comparable or improved results when tested in various scenarios from the CAVIAR standard data set for object tracking.
1 Introduction Vision-based object tracking is essentially a problem of image correlation. Any tracking algorithm must be able to recognize how certain objects (represented as points or regions) correlate between two images. Many approaches are feature-based. These approaches attempt to find regions of interest in an image. The underlying assumption, of course, is that the object(s) to track will correspond to “interesting” regions. The core of any feature-based approach is the chosen interest region detector. Such a detector must produce regions that are consistently identified, matched, and localized from one frame in an image sequence to the next. Certain region detectors are more robust than others with respect to particular types of image deformations. Given a priori knowledge of our particular tracking problem, we can choose a suitable detector. But, for a general algorithm, we need a detector that is robust with respect to all common image deformations. The detector must produce regions that are invariant to changes in illumination, rotation, scale, affine transformation, and background. Maximally Stable Extremal Regions (MSERs) can be detected in a straightforward fashion, but are still robust with respect to changes in illumination, rotation, scale, and affine transformation. In this paper we explore the shortcomings of standard MSERs with respect to background changes, and address them with a color-based formula. We compare the results of our new descriptor against standard MSERs and other color-based extensions. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 358–366, 2011. © Springer-Verlag Berlin Heidelberg 2011
Color-Based Extensions to MSERs
359
2 Related Work 2.1 Object Tracking General purpose object tracking remains an open problem. Nevertheless, some approaches have been successful. Yilmaz, Javed, and Shah [7] give a detailed survey of object tracking methods. Optic flow [1] was one of the first such approaches. Optic flow works at the pixel level, finding pixel correspondences between two images. With enough pixel correspondences, one can confidently define the geometric relationship between two images. Mean shift filtering is a technique to smooth an input image. After filtering, region boundaries become more reliable. These regions are suitable for tracking. [3] This technique has been further extended with kernel-based object tracking. [2] Kernelbased tracking uses an isotropic kernel to quickly localize an interest region, rather than a brute force search of subsequent images. Low-level machine learning techniques have been the basis of successful object tracking schemes. Such techniques include support vector machines [4] and AdaBoost [5], among others. Feature-based object tracking encompasses a broad field of techniques, because the literature and background of feature detection is rich. Interest region (or feature) detection is relevant to almost all vision problems, including tracking but also object detection, classification, and vision-based navigation. Also, it is straightforward to adapt many high-level tracking algorithms to use any low-level interest detector. Algorithms have been developed to use the Harris corner detector [7], SIFT histograms [8], and MSERs [9], among others. 2.2 MSERs MSERs were designed to address the problem of wide-baseline stereo matching [9], an intrinsically similar problem to object tracking. Both problems rely on the computation of a correlation between two images. MSERs are constructed through an iterative process that gradually joins the pixels of an image together. The MSER algorithm detects regions that are intrinsically “lighter” or “darker” than their surroundings. By nature it can detect regions of arbitrary geometry, a desirable property for general object tracking. The MSER algorithm has been improved since its inception. There exists a fast (worst-case linear time) algorithm to find MSERs [10]. MSERs have been shown to work well with the SIFT feature descriptor, and have been extended to be robust to scale changes [11]. Object tracking algorithms have been tailored to MSERs that exploit their structure for a faster, more efficient tracker [12]. Methods to incorporate color to MSERs have been investigated [13].
3 Methodology Our goal is to improve the behavior of MSERs by addressing their most crucial shortcoming. MSERs are not capable of capturing regions that are both darker and lighter than certain portions of their immediate surroundings.
360
A. Chavez and D. Gustafson
The standard MSER algorithm begins by looking at the pixels with the lowest intensity. Gradually, pixels with greater intensities are incorporated into the regions. By the end of the process, all pixels have been merged into one region. Regions that were stable for a significant amount of “time” are deemed MSERs. If we use the standard intensity measure of a pixel (the luminance), then black pixels will have the lowest intensity, and white pixels will have the highest intensity. Every MSER detected will correspond to a region that is appreciably darker than its surroundings. For greater coverage, the standard MSER algorithm runs twice, once on the image and once on the inverted image. This produces two kinds of regions: regions that are strictly darker than their surroundings, and regions that are strictly lighter. It is impossible for either of these measures to capture a region that is lighter than a certain portion of its surroundings, and darker than another portion. A building that is lighter than the ground beneath it but darker than the sky surrounding it would not be detected. This is a significant problem. Most real-word scenarios have widely varying backgrounds. It is reasonable to expect that some objects will be lighter than certain background elements, but darker than others. Nevertheless, the MSER algorithm is robust enough to be adapted. The algorithm is defined in a general fashion and can use almost any function for intensity. We are not strictly limited to measuring luminance. In fact, any function that maps pixels to a totally ordered binary relation will suffice. So, the task is then to produce a suitable function that will capture regions that are simultaneously lighter and darker than certain portions of their surroundings. The function we are looking for is such that every pixel in the object maps to a higher intensity than every pixel in the background (or, similarly, every pixel in the object maps to a lower intensity). There are many kinds of backgrounds and many kinds of objects, so it is unlikely that one function can capture this relationship. But, perhaps a small family of functions could capture a large percentage of objects. It is natural at this point to consider incorporating color information. An object that is both lighter and darker than certain background components might be “redder” than both. Perhaps there is a red ball against a light blue sky, sitting on dark green grass. Such an example is contrived, but empirically we find that with a small family of color functions, we can segment most objects from their backgrounds, even when those backgrounds are visually diverse. All we need do is run the MSER algorithm for each color function. The MSER algorithm will need to run several times, but the time cost is mitigated by the fact that the process is trivially parallelizable. We find that a very small group of color functions (one for red, one for green, one for blue, and the two standard grayscale functions for MSER) give reasonable coverage and appreciable improvements on the standard MSER algorithm. Since additional functions can be run in parallel, implementations that have access to significant parallel computation resources could potentially run many more than five functions. 3.1 Algorithm Modified intensity functions. The intensity functions we select must provide a wide variety of possible separations between an object and its background.
Color-Based Extensions to MSERs
361
Our initial efforts simply applied the MSER algorithm on each color channel of the RGB image. This was found to be inadequate, as nearly all detected regions were similar or equal to those of the standard grayscale MSER. To resolve this, we moved to the HSV color space. In HSV, the color “red” is characterized by a hue of 0 (on a 360-degree color wheel), and saturation/value of 1. A low value would imply black, a low saturation would imply gray, and a hue that was far from 0 would imply a color different than red (180 would correspond to blue-green, the “furthest” color from red in some sense). So, given a hue h, a saturation s, and a value v, we define red(h, s, v) = |180 – h| * s * v
(1)
This intensity function does a good job of separating red pixels from dissimilar pixels, including both grayscale pixels and pixels with an identifiable color. Functions for green and blue are defined similarly. These three functions comprise our color operators. For grayscale, we use the standard MSER+ and MSER- functions, which simply detect dark and light regions. Implementation. Other than the modified intensity functions, the algorithm is nearly identical to the standard MSER algorithm. We do incorporate certain known optimizations. For speed, we use the linear-time MSER implementation found in [12]. Another optimization originates from [13]. The distribution of pixel intensities in an image is rarely uniform, especially under blur and certain illumination conditions. But, a uniform distribution would be preferable in order to recognize the relative difference between two pixels. Thus, we derive an approximately uniform distribution of intensities by sorting the pixels in ascending order of intensity and placing them into 100 approximately equally-sized bins using a greedy strategy. Importantly, two pixels with equal intensities are always placed in the same bin. Before finding MSERs, we perform a Gaussian blur on the image to reduce highfrequency information. Our detection scheme is only indirectly based on gradient information and thus is robust to significant blurring. The MSERs are then detected. For a descriptor, we compute a center and orientation based on the centroid and moments of the region. After we have a center point and orientation, we use the SIFT descriptor to characterize the region. We found this to be preferable to color moments (frequently used as a descriptor with MSERs) for this particular task. Correspondences from one image to the next in a sequence are then found using basic SIFT matching. We use the SiftGPU implementation [15] to maximize parallelization. Evaluation. The color-based functions, predictably, compare favorably to the standard MSER algorithm when recognizing objects with a clearly discernible color (see Figure 1). However, they may also be useful in more general scenarios. In a tracking scenario, it is common for an object to pass across many varied backgrounds, and perhaps even to become occluded by various objects. As mentioned before, the standard MSER algorithm has significant difficulty attempting to track an object across portions of the background that are simultaneously lighter and darker than the object.
362
A. Chavez and D. Gustafson
Fig. 1. comparison of color function against standard MSER. In the left image, the standard MSER+ algorithm is unable to recognize the entire robot in the center. It can only detect small features that are not easily matched in subsequent frames. In the right image, the red function easily captures the entirety of the (red) robot, providing a good region for subsequent matching.
For an object without a strong color signature, it is unlikely that our color functions will discriminate the object from the background more effectively than the standard MSER algorithm in average situations. Noticeable differences in luminance are more common than noticeable differences in hue. However, an object that moves across a visually diverse scene may be difficult to discern using the standard MSER detector at certain intervals in the tracking process. If the object moves into an area with different lighting, different occluding objects, or different background elements, a color-based intensity function may become favorable for a short time interval.
4 Testing We want to test the extent to which our color functions provide complementary information to the standard MSER algorithm in general tracking scenarios. We use a simple correspondence measure to evaluate the ability of the algorithm to recognize a moving object from one phase to the next. We perform our tests on several scenarios from the CAVIAR dataset. We wish to measure the correspondences of the detector, but only with respect to the objects we are tracking (not to the background). So, we define a restricted measure, calculated as follows: • We compare each pair of consecutive images and find the matching MSERs (based on a simple SIFT match). Only pairs of images that contain at least one object to be tracked, according to the ground truth file, are considered. • We reject matches that are part of the background by referring to the ground truth bounding box(es) of the object(s). For any match, if the centroid of either region lies outside the bounding box of a tracked object, we throw out that match. It does not count as a correspondence.
Color-Based Extensions to MSERs
363
• Also, for a match to qualify as a correspondence, we require that the overlap error in the ellipses be no greater than 40%, as suggested by [14]. The minor change in the position of the object introduces a slight inaccuracy into this computation, but the object moves very little from frame to frame in the CAVIAR scenarios. • Correspondences meeting these criteria are tallied. For each individual color or intensity function, we tested this correspondence measure on every pair of consecutive images in the scenario. Since we did not synthesize the functions into one algorithm, we can observe the extent to which they provide complementary information to each other. If different color functions provide different matches, they might be combined into a much more robust overall detector. To compare our approach with existing color-based variations on MSERs, we performed the same correspondence tests on the MSCR detector [13]. However, the tests could not produce viable results. This is probably due to the format of the CAVIAR test data (384 x 288 JPEG images). It is mentioned in Forssen that MSCRs are quite sensitive to JPEG compression. After analysis, we believe this is because MSCRs are defined on the gradient image. JPEG compression creates artifacts in the form of 8x8 blocks. This produces many false minor edges in the gradient image that result in undesirable merging of regions, rendering the algorithm unusable on these images.
5 Results Figure 2 shows the results of the correspondence tests. We display results on three 1 scenarios from the CAVIAR data set. In the left columns, we observe the number of valid correspondences found by each function. Unsurprisingly, MSER+ (dark regions) finds the most correspondences overall. However, each color function produces a large number of correspondences as well. The red color function is particularly effective, finding the most correct correspondences of any intensity function on the second scenario. In the right columns, we measure the extent to which the color functions complement the standard MSER intensity functions. For each pair of images, we check if any correct correspondences were found with a standard intensity function, and whether any correct correspondences were found with a color function. We find that when we add the additional color functions, at least one correct object correspondence is found in almost every pair of images. The combined function group finds correspondences in 99% of the images in the first and third scenarios. The second scenario contains many objects entering and leaving view. So for many frames, an object may comprise only a couple pixels on the edge of the image, and finding a correspondence is nearly impossible. This data affirms that the color functions are indeed recognizing correspondences under different circumstances than the grayscale intensity functions. Then, a tracker that incorporated both would have access to additional, non-redundant information. 1
EC Funded CAVIAR project/IST 2001 37540, found at the following URL: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
364
A. Chavez and D. Gustafson
Fig. 2. Correspondence results for three scenarios from the CAVIAR data set (top row: OneLeaveShop1cor, middle row: TwoLeaveShop2cor, bottom row: OneLeaveShopReenter2cor). The left column denotes the number of valid correspondences found for each intensity function. The right column denotes the fraction of frames where at least one valid correspondence was found by any intensity function.
Color-Based Extensions to MSERs
365
6 Conclusion We have presented an adaptation of MSERs improves their behavior by incorporating color information. Like MSERs, our interest operator is robust with respect to common sources of image noise, but it is also able to detect objects on varying backgrounds. With parallelization our algorithm runs at the same speed as the standard MSER algorithm. Future work should explore more nuanced functions to properly separate objects from backgrounds. Texture operators or other filters could be feasible, as the function could easily be defined on the area surrounding each pixel, rather than simply the pixel itself. Each function need not work on every object, but rather be useful enough to justify its inclusion in a family of discriminating functions. Also, the algorithm should be incorporated into a complete object tracking system. Optimizations might make background segmentation even easier, and more robust methods exist to track behavior over a series of images (rather than just comparing two images at a time). Such a system demands testing in a more comprehensive object tracking scenario.
References 1. Horn, B., Schunk, B.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 2. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 564–575 (2003) 3. Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1197–1203 (1999) 4. Papageorgiou, C., Oren, M., Poggio, T.: A general framework for object detection. In: Proceedings of the Sixth IEEE International Conference on Computer Vision, pp. 555–562 (1998) 5. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, pp. 734–741 (2003) 6. Yilmaz, A., Javed, O., Shah, M.: Object Tracking: A Survey. ACM Journal of Computing Surveys 38(4) (2006) 7. Harris, C., Stephens, M.: A combined corner and edge detector. In: 4th Alvey Vision Conference, pp. 147–151 (1988) 8. Lowe, D.G.: Object Recognition from Local Scale-Invariant Features. In: In Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 1, pp. 1150–1157 (1999) 9. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proceedings of the Thirtheen British Machine Vision Conference, pp. 384–393 (2002) 10. Nistér, D., Stewénius, H.: Linear time maximally stable extremal regions. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 183–196. Springer, Heidelberg (2008) 11. Forssén, P.-E., Lowe, D.: Shape descriptors for maximally stable extremal regions. In: Proceedings of the Eleventh International Conference on Computer Vision, pp. 59–73 (2007)
366
A. Chavez and D. Gustafson
12. Donoser, M., Bischof, H.: Efficient maximally stable extremal region (MSER) tracking. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 553– 560 (2006) 13. Forssén, P.-E.: Maximally stable colour regions for recognition and matching. In: IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, USA. IEEE Computer Society Press, Los Alamitos (2007) 14. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A comparison of affine region detectors. International Journal of Computer Vision 65, 43–72 (2005) 15. Wu, C.: SiftGPU: A GPU Implementation of Scale Invariant Feature Transform (2007), http://cs.unc.edu/~ccwu/siftgpu
3D Model Retrieval Using the Histogram of Orientation of Suggestive Contours Sang Min Yoon1 and Arjan Kuijper2 1
Digital Human Research Center, AIST, Tokyo, Japan 2 Fraunhofer IGD, Darmstadt, Germany
Abstract. The number of available 3D models in various areas increases steadily. Efficient methods to search for 3D models by content, rather than textual annotations, are crucial. For this purpose, we propose a content based 3D model retrieval system using the Histogram of Orientation (HoO) from suggestive contours and their diffusion tensor fields. Our approach to search and automatically return a set of 3D mesh models from a large database consists of three major steps: (1) suggestive contours extraction from different viewpoints to extract features of the query 3D model; (2) HoO descriptor computation by analyzing the diffusion tensor fields of the suggestive contours; (3) similarity measurement to retrieve the models and the most probable view-point. Our proposed 3D model retrieval system is very efficient to retrieve the 3D models even though there are variations of shape and pose of the models. Experimental results are presented and indicate the effectiveness of our approach, competing with the current – more complicated – state of the art method and even improving results for several classes.
1 Introduction The rapid increase in the number of available 3D models requires accurate, automatic, and effective methods to search for 3D models based on their content, rather than on textual annotations. It is crucial for many applications such as industrial design, engineering, and manufacturing, to provide for scalable data management. This need has led to the development of several approaches to compute the similarity between two 3D models [1] in recent years by using algorithms that exploit the shape histogram [2], the shape distribution [3], moments [4], light fields [5], or 3D harmonics [6]. Following such approaches, users can search for 3D models by supplying an example query object. The actual approach to compute a descriptor can be classified into several categories [1, 7]. These include histogram-based, graph-based, shape-based, and image-based approaches. In this paper we propose an approach to compute an image-based (or more precisely, view-based) descriptor using suggestive contours [8]. We provide the suggestive contours of each 3D model from several predefined viewpoints and compute a feature vector based on the orientation of these contours. By comparing such feature vectors we can rank the 3D models according to their similarity to an example model. Since we projected each 3D model from several viewpoints, we can also align the orientation of said 3D models. In [9] such an approach using suggestive contours was proposed G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 367–376, 2011. c Springer-Verlag Berlin Heidelberg 2011
368
S.M. Yoon and A. Kuijper
Fig. 1. Flowchart of our proposed 3D model retrieval approach
for sketch-based retrieval in 3D data bases. In contrast, we apply this idea to the Queryby-example search paradigm [9]. We provide an experimental effectiveness comparison with the current state of the art combination of several 3D descriptors DSR [10], showing the effectiveness of our approach: competing with the more complicated DSR and even improving results for several classes. Figure 1 shows how we extract the meaningful features from the complex 3D mesh models. Our proposed system is composed of three steps: 1) Extracting the suggestive contours from different viewpoints. 2) Feature analysis using diffusion tensor fields. 3) Similarity measurement using the histogram of orientation. The remaining part of this paper is organized as follows. In Section 2 we briefly survey related work in the area of 3D model retrieval. In Section 3 we explain in detail, how we extract features from 3D models using suggestive contours images and measure the similarity using the histogram of orientation that is based on the properties of diffusion tensor fields. In Section 4 we present the experimental results in retrieving the 3D models from a large database, followed by a discussion in Section 5.
2 3D Model Retrieval and Histogram of Oriented Gradients There are numerous approaches in 3D model retrieval to compute the similarity between two objects. A good overview can be found in [7]. Approaches can be distinguished by supporting global or partial model similarity. Global methods determine the overall similarity of the entire shape, while partial methods analyze for local similarities. We compute a feature vector to describe the global shape of a model. It uses view-based features as in [9], meaning we project and render 3D models as images from several view-points. This enables retrieval that is robust to changes in orientation [10, 11]. Methods from 2D shape analysis and content-based image retrieval also become applicable to compute a feature vector of each view-image, see e.g. [12, 13]. From the features of the suggestive contours [8], we encode magnitude and orientation properties of the diffusion tensor field as a histogram. This relates to the rather successful image descriptor Histogram of Oriented Gradients (HOG) used primarily for
3D Model Retrieval Using the Histogram of Orientation of Suggestive Contours
369
Fig. 2. Object, suggestive contours, ridges and valleys, and outline contour
detection and recognition tasks, that that was recently used for 3D model retrieval [14, 15]. Apart from computing a suitable descriptor for the view-images of the 3D model, the challenge of projecting and rendering a 3D model in a meaningful way remains. Previous approaches in 3D model retrieval rely on projecting the silhouette (contour) of a model [16–19]. Such a rendering does not account for most of the detail found in the original model. Looking in the direction of non-photo-realistic rendering techniques [20], suggestive contours [8] were created with the goal to retrieve the resemble 3D objects as closely as possible. Accordingly we argue, as it conveys three dimensional properties of the model in each view.
3 Our Approach Our approach for 3D model retrieval evaluates the similarity by comparing 14 projected views of each 3D model. For each such image, we extract a histogram of orientation from the corresponding diffusion tensor fields of the suggestive contours. 3.1 Suggestive Contours Extraction from Different Viewpoints In information retrieval it is very important to extract the efficient features and measure the similarity as closely as possible to the intention of the user. Our approach is to retrieve the relevant 3D models by evaluating view-based similarity between projected images of the 3D models. Finding appropriate projection and rendering techniques is crucial. The performance of 3D model retrieval depends on the selection of the projected type, the number of view-point position and view directions. To overcome the drawbacks of the previous approaches and find the most similar features of the 3D model, we extract the suggestive contours (http://www.cs.rutgers.edu/ ˜decarlo/contour.html) to construct the shape descriptors from different viewpoints [8]. The suggestive contours provide the lines drawn on clearly visible parts of the surface, where true contours would first appear with a minimal change in view-point. The boundary contour of a 3D model is very easy to extract and visualize the characteristics of the 3D model, but we cannot use the interior information of a 3D model. On a smooth surface, ridges and valleys provide features like creases, and can help to convey the structure and complexity of an object. Ridges and valleys, however, lack the view-dependent nature that intention of the user posses. The outline as well as major ridges and valleys of an example model are shown in Figure 2 together with the suggestive contours. This is a very minimalistic way to convey three dimensional shape by a two dimensional image.
370
S.M. Yoon and A. Kuijper
Fig. 3. Left: 3 × 4 image patch. The blue painted ellipses (with Eigensystem) are elements of suggestive contour. Middle: Viewpoints for an object. Right: The red painted pixel elements of the suggestive contours represented as ellipse models from different view points.
So to be able to retrieve the relevant 3D models, we render the suggestive contours of each model from 14 different, equally spaced view-points. We use 6 orthographic projections and 8 isometric projections (see Figure 3, middle). We use these 14 viewpoints as a compromise between accuracy to the view-point the user had in mind when processing time of the algorithm. 3.2 Features in Diffusion Tensor Fields To extract a feature vector from each suggestive contours image and query model itself, we analyze its properties in the space of diffusion tensor fields as this provides information of the target objects’ gradient information in a stable way. Diffusion tensor fields have been originally introduced in the area of medical image processing to measure the diffusion of water in tissue. Using this technique, it is possible to analyze the motion of deformable objects which have a high-degree of freedom [21]. The diffusion tensor field T at each pixel is given by Txx Txy , (1) T = Tyx Tyy where Txy = Tyx , so this corresponds to a symmetric matrix. This matrix can be reduced to its principal axes by solving the characteristic equation (T − λ · I)e = 0,
(2)
where I is the identity matrix, λ are the eigenvalues of the tensor and e are the orthonormal eigenvectors. In each pixel the tensor can be represented by an ellipsoidal model, where the main axis length is proportional to the eigenvalues λ1,2 (λ1 > λ2 ). Each pixel of the suggestive contours within a projected image is represented as a two-dimensional ellipse. These properties of each ellipse are later on organized into a histogram according to their orientation and their magnitude. Figure 3 illustrates this. On the left, the ellipsoidal representation of the suggestive contours elements are shown in blue. Each pixel of the suggestive contours within a projected image is represented as a two dimensional ellipsoid, whose direction and scale are determined by the corresponding eigenvalues and eigenvectors from the diffusion tensor field. Ont the right such an ellipsoidal representation of the suggestive contours of several projected views of a 3D model is shown.
3D Model Retrieval Using the Histogram of Orientation of Suggestive Contours
371
Fig. 4. Top: The direction of the main axis of the ellipse model, determined by the eigenvalues of the diffusion tensor fields at the suggestive contours. Bottom: An example of the Histogram of Orientation.
3.3 Computing Similarity - Based on Histogram of Orientation To measure the similarity to efficiently retrieve the relevant 3D models in a large data base, feature based similarity measure methods are popularly used by evaluating the features or descriptors of 3D geometry [22–24]. We use the Histogram of Orientation (HoO) of the suggestive contour. We coin this HoO, as we do not use gradient information as in the HOG method, but second order derivatives, cf. Eq. (1). This is of course a related approach, as we measure the the direction in which the gradient change is extremal, i.e. using a local coordinate frame in each pixel. As an advantage, second order derivatives are less sensitive to perturbations than gradients. The main directions given by the ellipsoidal model are distributed over several bins. The histogram of orientation is thus constructed by adding the number of suggestive contours pixels according to the main direction derived from the eigenvectors (see Figure 4). Given a pair of images, Ic and Is , of which Ic represents the suggestive contours of a query 3D model and Is is a suggestive contours image from the database, both normalized to a fixed size, we define an aligned distance score that accounts for the normalized deformation between the two images. Using the properties of the ellipsoidal representation of each contour pixel, we compute the histogram-based feature vectors Hc and Hs as follows: 1. We firstly extract the magnitude m(x, y) and orientation θ(x, y) of the ellipsoidal representation of each contour pixel. As aforementioned, the ellipses are defined by the eigenvalues and eigenvectors from the analysis of the suggestive contours in the topological space of diffusion tensor fields. 2. We quantize the orientation into n orientation bins weighted by the corresponding magnitude m(x, y). We quantize the orientation into 18 bins as shown in Figure 4. The quantized orientation is extracted from the direction of main axis of the ellipsoidal representation of the suggestive contour. The main direction of the ellipsoidal model is determined by the eigenvector e1 . 3. The resulting feature vector of histogram of orientation, Hs and Hc , are normalized to unit length vectors by the sums of all entries.
372
S.M. Yoon and A. Kuijper
Fig. 5. Representative 3D models used for 3D model retrieval in our experiments
4. The similarity S between the query image Ic and one view image Is of a 3D model is then given by the following equation: S(Ic , Is ) =
Hc · Hs ||Hc ||||Hs ||
(3)
Note that the value of S(Ic , Is ) lies in the interval [-1,1]. If the histograms Ic and Is are identical then S(Ic , Is ) = 1. 5. For 3D model retrieval, we projected the 3D model into 14 different viewpoints. The similarity measure between a query image and a 3D model is determined by the extracting max|S(Ic , Is )| over all 14 different view point similarity measures. This maximum value is obtained at the most likely view point. The advantage of the histogram of orientation in the space of diffusion tensor fields is that it is very robust in retrieving the highly relevant 3D models – even though there are partial occlusion or translation of the query model – because the histogram of orientation features are invariant to geometric and photometric transformations of the features.
4 Experiments In this section, we present several experiments to show our proposed methodology. We discuss i) the setup for the experiments, ii) the retrieved results and its similarity from a query 3D model, and iii) performance measurements of our proposed 3D model retrieval and the best performing method of the state-of-the-art algorithms available. 4.1 Experimental Setup We conducted several experiments to evaluate the retrieval performance of our approach. For our experiments, we used 3D mesh models from the Princeton Shape Benchmark1. We used 260 models from 13 classes, i.e. “human”, “cup”, “airplane”, “ant”, “chair”, “sunglasses”, etc. Before extracting the feature from 3D models, we first rotate, translate, and normalize the size of the 3D models to improve the robustness in extracting the features and measuring the similarity from unknown 3D models. Figure 5 shows the representative 3D models which are used in our experiments. 1
http://segeval.cs.princeton.edu
3D Model Retrieval Using the Histogram of Orientation of Suggestive Contours
373
Fig. 6. 2D view of the top ranked 3D models and their similarity from a query model (leave-oneout)
Accordingly we rendered 260×14 suggestive contours images with resolution 826× 313 pixels. We used the 14 different viewpoints of a 3D model to retrieve the 3D models as described above. Literature is not consistent on this point: Funkhouser et al. [16] used 13 orthographic view-points, Chen et al. [17] used 10 shaded boundary images from 20 view-points, and Macrini et al. [18] used 128 projected images for 3D model retrieval. The performance of the 3D model retrieval is very dependent on selection of the position of the projected images and number of the images. We therefore used the equally spaced points shown in Figure 3 (middle). These points have no preferred view point, yielding unbiased results. 4.2 Evaluation of 3D Model Retrieval Using Our Proposed Approach In this section, we show the retrieved 3D models and the similarity from a query model using our approach. We first analyze the 3D model retrieval from a query 3D model and the variation of the similarity of the top ranked 3D models. Figure 6 shows the top ranked 6 models, retrieved from query models like “bird”, “ant”, “human”, and “cup”. The retrieved results from the query models have robust retrieval results, although sometimes the query retrieves the wrong 3D models in the database when objects have very similar shape and pose – and thus a similar distribution of the histogram of orientation from the projected suggestive contours. These are intentionally shown in Figure 6. Often such a misclassification occurs at a drop of the similarity measure value, as in the “bird” case.
374
S.M. Yoon and A. Kuijper
Fig. 7. Screen shots of the demo system
Fig. 8. 3D model retrieval comparison between our approach and the DSR based approach with their first tier precision percentage
Figure 7 shows screen shots of our 3D model retrieval demo system. They show that our proposed methodology works robustly even in the presence of rotation, scaling, and shape differences from a query model. 4.3 Comparison of Retrieved Performance Since we argue that HoO of suggestive contours are particularly suitable for view-point based 3D model retrieval, we also conducted experiments concerning the difference of the retrieved results when rendering views using other features which are popularly used in 3D model retrieval. To evaluate this we present the first tier precision, defined as a percentage of k correct models within the top k retrieved models of the 3D model class for 20 times. We randomly select 5 query models from each class and tested the 3D model retrieval from other 3D models from database. Figure 8 is the comparison of first tier precision between our approach and the DSR based approach [10, 25], a hybrid form using Depth buffer, Silhouette, and Ray-extents of the polygonal meshes.
3D Model Retrieval Using the Histogram of Orientation of Suggestive Contours
375
From 13 model classes, the first tier of our proposed approach is already better than DSR based approach in the four classes “human”, “airplane”, “tool”, and “bird”. Even though the shape of airplane is very similar to bird models, we could successfully retrieve the 3D models in database. In the case of tool model, the first tier is even 96.07. The “hand” and “octopus” model classes have the lowest first tier decision in our experiment for both approaches because the finger / tentacle models are understand as arms or feet of the other models. The DSR based 3D model retrieval approach is basically a combination of various features to correctly retrieve the 3D models. Our proposed methodology will provide better results if the features using suggestive contours in diffusion tensor fields are combined with these other features.
5 Discussion In this paper, we have presented an efficient 3D model retrieval using HoO of suggestive contours analyzed with diffusion tensor fields. To extract the meaningful features from a 3D model that have smooth contours and measure the similarity, 3D models are projected into various view-points and the suggestive contours are extracted. The suggestive contours are analyzed in the space of diffusion tensor fields, and each pixel is represented in an ellipsoidal model whose direction and scale are determined by its eigenvalues and eigenvectors. The histogram of orientation is used for input to measure the similarity and to order the similar 3D models in database. Our proposed method is very independent of the shape and pose of the query model, even though there are diverse variations present. Combining our approach with the features used in the DSR method may improve the general retrieval results and is part of future work. Based on our approach, we also find the similar projected viewpoints of the retrieved 3D models from a query model. It can be applied, for instance, to augmented reality to provide the natural human computer interaction for users. Future work therefore also involves extending our view-based retrieval approach to the partial 3D retrieval problem. To this end, interest-point-based image descriptors like SIFT seem an interesting approach to apply on the suggestive contours images.
References 1. Tangelder, J.W.H., Veltkamp, R.C.: A survey of content based 3D shape retrieval methods. Multimedia Tools Application 39(3), 441–471 (2008) 2. Ankerst, M., Kastenm¨uller, G., Kriegel, H.-P., Seidl, T.: 3D shape histograms for similarity search and classification in spatial databases. In: G¨uting, R.H., Papadias, D., Lochovsky, F.H. (eds.) SSD 1999. LNCS, vol. 1651, pp. 207–226. Springer, Heidelberg (1999) 3. Osada, R., Funkhouser, T., Chazelle, B., Dobkin, D.: Matching 3D models with shape distributions. In: Proceeding of Shape Modeling and Applications, pp. 154–166 (2001) 4. Elad, M., Tal, A., Ar, S.: Content based retrieval of VRML objects - an iterative and interactive approach. In: Proceeding of Eurographics Workshop on Multimedia, pp. 97–108 (2001) 5. Chen, D.Y., Tian, X.P., Shen, Y.T., Ming, O.: On visual similarity based 3D model retrieval. In: Eurographics, Computer Graphics Forum, pp. 223–232 (2003)
376
S.M. Yoon and A. Kuijper
6. Kazhdan, M., Funkhouser, T., Rusinkiewicz, S.: Rotation invariant spherical harmonic representation of 3D shape descriptors. In: Proceeding of the Symposium on Geometry Processing, pp. 156–164 (2003) 7. Li, B., Johan, H.: View Context: A 3D Model Feature for Retrieval. In: Advances in Multimedia Modeling, pp. 185–195 (2010) 8. DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., Santella, A.: Suggestive Contours for Conveying Shape. ACM Transactions on Graphics (Proceeding. SIGGRAPH) 22(3), 848–855 (2003) 9. Yoon, S.M., Scherer, M., Schereck, T., Kuijper, A.: Sketch based 3D model retrieval using diffusion tensor fields of suggestive contours. ACM Multimedia, 193–200 (2010) 10. Vranic, D.V.: 3D Model Retrieval. University of Leipzig, Germany, (2004) 11. Daras, P., Axenopoulos, A.: A Compact Multi-view Descriptor for 3D Object Retrieval. In: International Workshop on Content-Based Multimedia Indexing, pp. 115–119 (2009) 12. Latecki, L.J., Lakaemper, R., Eckhardt, U.: Shape Descriptors for Non-Rigid Shapes with a Single Closed Contour. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1063–6919 (2000) 13. Datta, R., Li, J., Wang, J.Z.: Content-based image retrieval: approaches and trends of the new age. In: MIR 2005: Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, pp. 253–262 (2005) 14. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005) 15. Scherer, M., Walter, M., Schreck, T.: Histograms of Oriented Gradients for 3D Model Retrieval. In: International Conference on Computer Graphics, Visualization and Computer Vision (2010) 16. Funkhouser, T., Min, P., Kazhdan, M., Chen, J., Halderman, A., Dobkin, D.: A search engine for 3D models. ACM Transaction on Graphics 22(1), 83–105 (2003) 17. Chen, D.-Y., Tian, X.-P., Shen, Y.-T., Ouhyoung, M.: On visual similarity based 3D model retrieval. Computer Graphics Forum 22(3) (2003) 18. Macrini, D., Shokoufandeh, A., Dickenson, S., Siddiqi, K., Zucker, S.: View based 3D object recognition using shock graphs. In: International Conference on Pattern Recognition (2002) 19. Cyr, C.M., Kimia, B.: 3D object recognition using shape similarity based aspect graph. In: International Conference on Computer Vision, pp. 254–261 (2001) 20. Hertzmann, A.: Introduction to 3D Non-Photorealistic Rendering: Silhouettes and Outlines. In: ACM SIGGRAPH 1999 Course Notes (1999) 21. Yoon, S.M., Graf, H.: Automatic skeleton extraction and splitting in diffusion tensor fields. In: IEEE International Conference on Image Processing (2009) 22. Kazhdan, M., Chazelle, B., Dobkin, D., Funkhouser, T.: A reflective summary descriptor for 3D models. Algorithmica 38(1), 201–225 (2004) 23. Zhang, C., Chen, T.: Indexing and retrieval of 3D models aided by active learning. ACM Multimedia (2001) 24. Ip, C.Y., Lapadat, D., Sieger, L., Regli, W.C.: Using shape distributions to compare solid models. ACM Solid Modeling, 273–280 (2002) 25. Vranic, D.V.: DESIRE: a composite 3D-shape descriptor. In: IEEE International Conference on Multimedia Expo., pp. 962–965 (2005)
Adaptive Discrete Laplace Operator Christophe Fiorio1 , Christian Mercat2 , and Frédéric Rieux1,3 1 LIRMM, Université Montpellier 2, 161 rue Ada, F-34392 MONTPELLIER, FranceIREM, S2HEP, Université Claude Bernard Lyon 1, 43 bd du 11 Nov. 1918, F-69622 Villeurbanne cedex 2 I3M, Université Montpellier 2, c.c. 51 F-34095 Montpellier Cedex 5, France
Abstract. Diffusion processes capture information about the geometry of an object such as its curvature, symmetries and particular points. The evolution of the diffusion is governed by the Laplace-Beltrami operator which presides to the diffusion on the manifold. In this paper, we define a new discrete adaptive Laplacian for digital objects, generalizing the operator defined on meshes. We study its eigenvalues and eigenvectors recovering interesting geometrical informations. We discuss its convergence towards the usual Laplacian operator especially on lattice of diamonds. We extend this definition to 3D shapes. Finally we use this Laplacian in classical but adaptive denoising of pictures preserving zones of interest like thin structures.
1
Introduction
Finding particular points on a discrete set is one of the most common problems in geometry processing applications. A particular example is to find a matching between pairs of shapes [12] and whether there exist isometric transformations between them. Another application is to find particular points that resist to a local deformation of the shape [14]. A large amount of work has been done in developing signature of a set defined by a digital mesh. Heat kernel or random walks have been widely used in image processing, for example lately by Sun, Ovsjanikov and Guibas [15] and Gebal, Bærentzen, Aanæs and Larsen [5] in shape analysis. In [15], a multi-scale signature was proposed, based on heat diffusion in order to detect repeated structure or information about the neighborhoods of a given points. This approach is connected to isometric matchings between pairs of shapes [13]. The heat kernel is also an isometric invariant, therefore studying it on each manifold, allows to compute a best matching map between the two shapes. In [1], a generalisation of diffusion geometry approach is proposed based on spectral distance. The present article adapts to the digital geometry framework the properties of the Laplace operator on meshes. The main works in geometry diffusion [15,5,13,1] are based on meshes shapes. We define as in those previous works a diffusion G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 377–386, 2011. c Springer-Verlag Berlin Heidelberg 2011
378
C. Fiorio, C. Mercat, and F. Rieux
kernel on objects which are not meshes but digital objects made of a subset of Z2 (a set of pixels) or Z3 (a set of voxels). In [4] an auto-adaptive digital process which captures information about the neighborhood of a point in a shape is introduced. They set up walkers on a digital object, and we observe the possibilities for them to walk along the discrete set. In this way, we compute weights corresponding to the time spent by a walker on each point of the shape. This approach can be extended to 3 dimensional sets. We propose in this paper to study the relevance of this operator through the study of its eigenfunctions. The classical eigenfunctions of the Laplace-Beltrami operator are widely used in the mesh community to recover geometrical information about shapes [11,7]. For example in [7], Bruno Levy computes an approximation of these eigenfunctions to understand the geometry of manifolds. This paper is organized as follows. First in Sec. 2, we describe an adaptive digital diffusion process on voxels and its associated Laplace operator. We extend this process on lattice of diamonds and we prove the convergence toward the usual Laplace operator in Prop. 5. Then in Sec. 3, we present two particular cases of use of this Laplacian, widely studied in the mesh community, in order to show its relevance. Finally in Sec. 4, we give another well known use of the diffusion as a classical convolution mask in gray-level images, to smooth and denoise. But the mask we use is our adaptive Laplacian and we give examples demonstrating that it preserves thin structures.
2 2.1
Diffusion Processes Heat Diffusion
The heat kernel kt on a manifold M maps a pair of points (x, y) ∈ M × M to a positive real number kt (x, y) which describes the transfer of heat from y to x in time t. Starting from a (real) temperature distribution T on M , the temperature after a time t at a point x is given by a convolution of the initial distribution with the diffusion kernel: t f (y) kt (x, y) dy. H f (x) = M
The heat equation is driven by the diffusion process, the evolution of the temperature in time is governed by the (spatial) Laplace-Beltrami operator ΔM : ∂f (t,x) = −ΔM f (t, x) which presides to the diffusion on the manifold, just as ∂t random walks. For a compact M , the heat kernel has the following eigen-decomposition: e−tλi φi (x)φi (y) kt (x, y) = y∈M
where λi and φi are the i − th eigenvalue and the i − th eigenfunction of the Laplace-Beltrami operator, respectively.
Adaptive Discrete Laplace Operator
379
The heat kernel kt (x, y), lately used by Sun, Ovsjanikov and Guibas [15], yields information about the geometry of the manifold. We have proposed in [4] a digital diffusion process which is adaptive to the geometry of a digital object. We defined a diffusion kernel similar to the continuous one just described and in this article we give examples of its use on 2D and 3D object. 2.2
Auto-adaptive Process
Definition 1 (Adaptive Markov Chain) Let Σ ⊂ Zn be a binary digital set, a sets of voxels. We define on Σ the discrete time Markov chain whose states are voxels, and whose transition between two adjacent voxels is constrained by: – Probability 21n to move from the center of the voxels to one of its corners, – Equiprobable repartition of the walkers from a corner to its incident voxels. To illustrate the definition we propose an example on a 2D set (Fig. 1). We set up 24 walkers on the gray pixel to get an integer number of walkers in each pixel. Remark 2. This standard choice amounts to walkers with no memory and only local knowledge, the celebrated short-sighted drunken man. We note this process Am s for a walker starting at any given point, with m the number of iterations of the process. The 1-step Markov process transition matrix As is simply a weighted version of the adjacency matrix of the digital object M. We note u0 the distribution of walkers on the digital object at time 0. We call Xm the Markov chain defined by Def. 1 iterated m times and u(m, x) the number of walkers at x after m steps starting from u0 that is the expectation as a convolution of the initial distribution with the Markov kernel: u(m, x) = Ex u0 (Xm ) = u0 (y)Am s (x, y). y∈M
3 6
6
6
6
2 11
24
5
3
(a) From pixel to corners
(b) From corners to adjacent pixels Fig. 1. Diffusion on an irregular set
(c) Final mask after 1 step
380
C. Fiorio, C. Mercat, and F. Rieux
The evolution of this expectation follows u(m + 1, x) − u(m, x) = u0 (y) (Am s (As − Id)) 2
(1)
y∈M
= (As − Id) u(m, x)
(2)
Analogous to the case of the continuous heat diffusion, the diffusion equation reads Δu(m, x) = (As − Id) u(m, x) Δm We therefore define the discrete Laplacian ΔM := − (As − Id) 2 m Property m 3. On Z , the diffusion leads to the Gaussian binomial masks As (x, y) = |x−y|
This property was proven in [4]. The convergence of the Laplacian to the continuous one on the square lattice is a particular case of Prop. 5 on a lattice of diamonds. 2.3
Generalization to Lattice of Diamonds
In the previous section we defined a diffusion model on square pixels, with diagonal ratio equal to one. This model can be easily extended to quadrilaterals with a diagonal ratio equal to a more general value ρ, leading to a generalization of (Def 1) and similar convergence results. Definition 4. Let Σ be a sets of quadrilaterals of horizontal diagonal ratio ρ, that is to say a lattice of diamonds. We define on Σ the discrete time Markov chain where the states are the quadrilaterals, and the transition between two quadrilaterals is constrained by: – Probability corners and
ρ 2 2ρ+ ρ 1 ρ
2 2ρ+ ρ
to move from the center of a pixel to its two horizontal to the vertical ones.
– Repartition on the incident quadrilaterals weighted by the distance, ρ or 1 from the corner to the center of the neighbor. Proposition 5. Let Σ be a set of quadrilateral of horizontal diagonal ratio ρ. Then ΔM converges toward the usual Laplace operator. On irregular shapes, uneven adjacency between voxels produces irregular diffusion due to curvature. A similar convergence on irregular lattices has been proved in [10] on discrete conformal structures. We foresee that a similar proof will be possible for the current definition of the Laplacian and it will be the subject of future work. Moreover in the next section, we propose examples to show the relevance of the operator on irregular structures, to recover information about the geometry of shapes.
Adaptive Discrete Laplace Operator
3
381
Application of the Laplace Operator on 2D and 3D Discrete Objects
In this section we propose an application of the discrete Laplacian on 2D and 3D discrete objects. In [7], Bruno Levy uses the eigenfunctions of the LaplaceBeltrami operator Δ = ∂ 2 /∂x2 + ∂ 2 /∂y 2 of the considered object to understand its geometry or its topology. These eigenvectors are proven to be noise resistant and a cut-off in frequency provides interesting unsupervised segmentations. A similar idea is proposed in [15], with the heat kernel signature (HKS) of a digital shape. HKS is a natural multi-scale characterization of the neighborhood of a given point x. We construct a similar signature for pixel or voxel discrete shapes and show on examples that they capture information of the global geometry: given the spectrum sp(Δ) = {λi } and eigenvectors Δφi = λi φi , we construct the Heat Kernel Signature e−mλi φi (x)φi (y) km (x, y) = y∈M
3.1
Segmentation
Eigenvectors of the Laplacian, because of their interpretation as vibration modes and robustness to noise, have been widely used and documented in the mesh community for unsupervised clustering of protrusions and limbs segmentation [7,9]. The first eigenvectors, associated with highest eigenvalues, correspond to different “breathing” or “vibrating” modes, so that positive and negative value zones segment the object in meaningful regions. We give some examples (see Fig. 2c) of this use in the digital setup. 3.2
Heat Kernel Signature
We propose in this subsection an application of the eigenvectors of the Laplacian previously defined, as a digital signature of each point. This signature called HKS [15], has been applied with a version of the Laplacian operator on Meshes. We propose as an example of application the same operator based on our operator on voxels in Fig. 3.
4
Gray Level Diffusion: Application to Denoising
An application of this discrete version of the Laplacian is image denoising. The search for efficient image denoising algorithms is still active and related works on the subject is important. The main classical linear filtering used is the Gaussian kernel proposed in [8]. This kernel is optimal in regular parts but edges are blurred. Several methods are introduced in [2], to limit this blurring effect. An anisotropic weighted average to reduce the intensity of noise is proposed in [3]. They are based on extrinsic Gaussian filters while ours is adaptive to the digital object, converging to a Gaussian filter in the isotropic case:
382
C. Fiorio, C. Mercat, and F. Rieux
(a) First Eigenvectors on the octopus
(b) Second Eigenvector on the octopus
(c) Fourth eigenvector on the octopus
(d) Second Eigenvectors on the 3D hand
(e) Third Eigenvector on the 3D hand
(f) Third eigenvector of a 3D star
Fig. 2. Eigenvectors of the Laplacian on an octopus. The eigenvectors can be interpreted as vibration modes, each one of the first eigenvectors being associated with different tentacles of the octopus. A similar analysis could be done on 3D digital object, like a hand or digital star.
(a) Heat Kernel signature (HKS) computed on digital hand.
(b) Other view of the hand with HKS.
Fig. 3. On this figure, we have computed the HKS according to our definition of the Laplace operator. In blue the point of maximal curvature have been found efficiently, in red the point with low curvature. We map the hand with the values of the HKS for a given time m.
Adaptive Discrete Laplace Operator
383
We propose to define a Discrete Time Markov Chain on a gray-level image. The idea is to let a walker wander on a discrete image with a gray level intensity that represents a hilly landscape. We use the previous pixel to pixel transition, for different thresholds, weighted by the gray level (understood as an interest map) of adjacent pixels: We consider high gray values as high diffusion directions, that is to say the walker prefers to climb up on the highest value of its neighborhood. Let {g1 , g2 , . . . g8 } the sorted gray intensity values of the 8-neighbors of a (j) given pixel pi with g1 ≤ g2 ≤ . . . ≤ g8 . We note pi the j-th neighbors of pi with the gray level value gj . We construct iteratively the convolution mask for the 8-neighborhood of pi . At each iteration we look for the neighbors of pi above the current threshold, we compute its diffusion mask, then we multiply the transition probability by the smallest gray intensity in the set and we delete this pixels from the neighborhoods, updating the threshold to the next lowest value. We continue thinning the neighborhood until there are no more pixels. If all the values are equal, we only do one iteration. If all the value are different we must build eight different masks among the 28 possible. We note θi the number of gray intensity values (k) that are different. We note As the k-th matrix of transition of the set (this is the transition matrix where at least k neighbors have been deleted). Then the θi θi final values of the masque is given by: Mask(i) = A(k) gk s (i)gk / k=1
k=1
Property 6 (Regular Gray Mask) Let Σ be a gray level set of pixels. If the {g0 , g1 , g2 , . . . g8 } are all equal (with g0 the gray value of pi ), the the final mask is a classical Gaussian mask. Proof. We have all the gray intensity values that are equal, then θi = 1. Then the Mask centered on the pixel pi is only:
(a) Original image of Lena with Gaussian noise
(b) Classical convolution with a Gaussian mask.
(c) Convolution with the gray-level adaptive mask.
Fig. 4. An example of application of the adaptive gray masks of convolution on a noisy version of Lena. We compare the noise reduction with the classical Gaussian mask. Clearly the blurred effect is less important in the convolution with our adaptive mask, contours are highlighted by a sharper contrast and reduced noise.
384
C. Fiorio, C. Mercat, and F. Rieux
(a) A noisy image of a peacock.
(d) Scanned text page
(b) Shape of interest: feathers of the peacock.
(e) Convolution using Gaussian Mask
(c) Convolution with the gray-level adaptive mask.
(f) Convolution using Fast Fourrier Transform bandpass filter
(g) Convolution using Adaptative convolution mask
Fig. 5. An example of application of the adaptive gray masks of convolution on noisy image of peacock, and on letters. On those images, we want to preserve or enhance particular information, for example defined by a certain gray level range. For example in (5b) we want to preserve the structure of the feathers. With a classical Gaussian mask, the fine structure is erased by the convolution, here its not the case, smoothing is performed along the structures. Notice also the preservation of the eye of the feathers despite the convolution. On Fig. 5d, the original image is noisy. Fig. 5g reveals the contour, after the convolution, preserving the structures of the letters. We compare the adaptive convolution to classical gaussian and bandpass filter using fast Fourier transform.
Adaptive Discrete Laplace Operator
Mask(i) =
385
As (i)g0 g0
by Property 3, Mask(i) is a Gaussian mask.
The aim of this construction is to build convolution masks that are adaptive to gray level images. But on regular colors intensities we want to convolve a pixel with a mask that only depends on its distance to neighbors (Property 6) The final convolution mask can be also seen as the 1 step transition probability of a Discrete Time Markov Chain starting from the center to the neighbors. This diffusion allows a walker wandering on an image to diffuse faster in the highest gray values of the neighborhood. This is useful when the user, or statistical analysis, provides an interval of gray values selecting zones of the object which are likely to be of interest, or when an interest function, such as the contrast (see Fig. 5g) is given as a gray level map of interest. We then compute the adaptive convolution mask for this interest level “picture” and apply it to the original image. This way, zones of similar interest (whether high or low) are smoothed out as with a regular (adaptive) Gaussian mask, but zones of different interest levels are not as mixed, the diffusion taking place mostly along the level sets of constant interest, therefore preserving, or even enhancing the thin structures. Some results are shown in Fig. 4a, Fig. 5a and Fig. 5d. Those images are noisy and we chose a particular information to preserve. In the case of the peacock, we want to preserve the thin feathers of the bird, therefore selecting by statistical analysis, the range of intensities associated with the feathers as higher interest zones. A Gaussian mask would blur the feathers while our adaptive mask preserves them. For Lena, in Fig. 4c, we applied the mask of the noisy image on itself, and the result is an image which is less blurred with a reduction of the noise. We can compare the final result with an application of a Gaussian mask Fig. 4b
5
Conclusion
We have described a diffusion process on a digital object made of pixels or voxels, defined as a random walk on adjacent cells, generalizing diffusion on meshes. This process allows us to define a new discrete adaptive Laplace operator. We proved that this operator converges toward the usual continuous Laplace operator on diamond lattices. As in recent works on heat kernel spectral analysis for the Laplacian on meshes, we studied some properties of its eigenfunctions on particular objects and showed that we recover information about the geometry such as unsupervised segmentation or feature points detection. We have used this adaptive Laplacian on grey level images to smooth and denoise images while preserving regions or features of interest such as thin tubular structures. This work can be transposed to non binary 3D images and will be the subject of future work.
386
C. Fiorio, C. Mercat, and F. Rieux
References 1. Bronstein, M.M., Bronstein, A.M.: Shape recognition with spectral distances. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 1065–1071 (2011) 2. Buades, A., Coll, B., Morel, J.-M.: Image denoising methods. a new nonlocal principle. SIAM Review 52(1), 113–147 (2010) 3. Buades, A., Coll, B., Morel, J.-M.: Self-similarity-based image denoising. Commun. ACM 54(5), 109–117 (2011) 4. Fiorio, C., Mercat, C., Rieux, F.: Curvature estimation for discrete curves based on auto-adaptive masks of convolution. In: Barneva, R.P., Brimkov, V.E., Hauptman, H.A., Natal Jorge, R.M., Tavares, J.M.R.S. (eds.) CompIMAGE 2010. LNCS, vol. 6026, pp. 47–59. Springer, Heidelberg (2010) 5. Gebal, K., Bærentzen, J.A., Aanæs, H., Larsen, R.: Shape Analysis Using the Auto Diffusion Function. In: Konrad et al. (ed.) [6], pp. 1405–1413 6. Konrad, P., Marc, A., Michael, K. (eds.): Symposium on Graphics Processing. Eurographics Association (2009) 7. Lévy, B.: Laplace-beltrami eigenfunctions towards an algorithm that “understands" geometry. In: SMI, page 13. IEEE Computer Society, Los Alamitos (2006) 8. Lindenbaum, M., Fischer, M., Bruckstein, A.M.: On gabor’s contribution to image enhancement. Pattern Recognition 27(1), 1–8 (1994) 9. Mateus, D., Horaud, R., Knossow, D., Cuzzolin, F., Boyer, E.: Articulated shape matching using laplacian eigenfunctions and unsupervised point registration. In: CVPR. IEEE Computer Society, Los Alamitos (2008) 10. Mercat, C.: Discrete Riemann surfaces and the Ising model. Comm. Math. Phys. 218(1), 177–216 (2001) 11. Nadirashvili, N., Tot, D., Yakobson, D.: Geometric properties of eigenfunctions. Uspekhi Mat. Nauk 56(6(342), 67–88 (2001) 12. Ovsjanikov, M., Mérigot, Q., Mémoli, F., Guibas, L.: One point isometric matching with the heat kernel. In: Eurographics Symposium on Geometry Processing (SGP), vol. 29 (2010) 13. Ovsjanikov, M., Mérigot, Q., Mémoli, F., Guibas, L.J.: One point isometric matching with the heat kernel. Comput. Graph. Forum 29(5), 1555–1564 (2010) 14. Rustamov, R.M.: Laplace-beltrami eigenfunctions for deformation invariant shape representation. In: Belyaev, A.G., Garland, M. (eds.) Symposium on Geometry Processing. ACM International Conference Proceeding Series, vol. 257, pp. 225– 233. Eurographics Association (2007) 15. Sun, J., Ovsjanikov, M., Guibas, L.: A Concise and Provably Informative MultiScale Signature Based on Heat Diffusion. In: Konrad, et.al. (ed.) [6], pp. 1383–1392
Stereo Vision-Based Improving Cascade Classifier Learning for Vehicle Detection Jonghwan Kim, Chung-Hee Lee, Young-Chul Lim, and Soon Kwon IT Convergence Research Department, Daegu Gyeongbuk Institute of Science & Technology, Republic of Korea [email protected]
Abstract. In this article, we describe an improved method of vehicle detection. AdaBoost, a classifier trained by adaptive boosting and originally developed for face detection, has become popular among computer vision researchers for vehicle detection. Although it is the choice of many researchers in the intelligent vehicle field, it tends to yield many false-positive results because of the poor discernment of its simple features. It is also excessively slow to processing speed as the classifier’s detection window usually searches the entire input image. We propose a solution that overcomes both these disadvantages. The stereo vision technique allows us to produce a depth map, providing information on the distances of objects. With that information, we can define a region of interest (RoI) and restrict the vehicle search to that region only. This method simultaneously blocks false-positive results and reduces the computing time for detection. Our experiments prove the superiority of the proposed method.
1 Introduction In recent years, researchers have become increasingly interested in developing a driver assistance system (DAS). It plays a critical role in the intelligent vehicle research field and will be based on driverless vehicle technology. To develop the DAS, researchers use various sensors to replace human senses. Some already in use are parking assist sensor (ultrasonic), cruise control (radar), and night vision (infra-red). Among the human senses, however, the visual sense is the most important for driving—it is impossible to drive a vehicle without it. The driver uses eyes to gather information about the road as well as the positions and relative movements of other vehicles and pedestrians. To replace eyes, many researchers use a vision sensor whose structure and function allow it to mimic the eye. It is evidence the computer vision techniques are also increasingly popular among contestants in the DARPA (Defense Advanced Research Projects Agency) Grand Challenge competition for driverless vehicles. In this article, we introduce a vehicle detection method that uses computer vision techniques to detect vehicles moving in the same direction as the vehicle in which it is mounted. It incorporates CCD (charge-coupled device) vision sensors mounted in our experimental vehicle and looking forward. Since the advent of the adaptive boosting (AdaBoost) method for classifier training [1] in the computer vision field, many researchers have applied it for object classification. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 387–397, 2011. © Springer-Verlag Berlin Heidelberg 2011
388
J. Kim et al.
Viola and Jones applied it to face detection and observed good detection performance [2]. Their technique combines a Haar-like feature with the AdaBoost classifier. Many researchers [3-6] have applied the Viola and Jones method to vehicle detection and they too report comparatively good performances. The system’s high detection rate, however, yields many false positives (caused by the poor discernment of Haar-like features) and a slow processing time (caused by the large area searched by the detection window). These two disadvantages make difficulties for implementing the real system, as traffic moves through rapidly changing environments. In the next section, we propose a way of improving both the accuracy and the speed of detection. By using a stereo vision technique to create a depth map, we can define a RoI, and the detection window searches that region only for vehicles. The setting of a search range reduces the detection processing time and blocks the false-positive results from irrelevant regions. In a later section, we prove the superior performance of our method by experiment. There is also a conclusion section at the end of this article.
Fig. 1. Image Database Extraction
2 Methodology In this section, we explain the techniques of our improved vehicle detection method. It is based on Viola and Jones’s original work. We make a training database and train the classifier using the combination of Haar-like features. However, this technique has already been well documented by others [2-6], so is not described in detail here. We introduce a method of cascade classifier setting based on our experimental results table, and then propose our improved vehicle detection method using stereo vision. Finally, it is our entire system.
Stereo Vision-Based Improving Cascade Classifier Learning for Vehicle Detection
389
2.1 Database and Training 1) Database Configuration: To train the AdaBoost classifier, we need an image database made up of positive and negative image patches. In our case, the positive image patches appear like vehicles and the negative image patches are completely unlike vehicles. As shown in Figure 1, we extract the positive image patches from road traffic images. It is captured by CCD vision sensors and showing the rear side appearance of vehicles. Each vehicle’s image is then cropped square, and as tightly as possible without losing too much of the vehicle or including too much background. Square images make for easier AdaBoost training. When the image patches put into AdaBoost training algorithm, the image should convert to square shape. In this process, image patches are distorted. In order to preventing this distortion, we crop the image patches to square.
Fig. 2. Vehicle Images Configuration by Color
The patches are now numbered according to depth of vehicle color (the CCD images are grayscale). Vehicles are categorized as being nearer to white or nearer to black, with silver counting as white and highly saturated colors as black. Figure 2 shows an example of this database configuration. We expect this database normalization to give consistent performance as far as vehicle color is concerned. Negative image patches, none of which look anything like vehicles, are randomly extracted from the backgrounds of the same road traffic images. They, too, are cropped square. 2) AdaBoost Training: We use the adaptive boosted training method for vehicle detection. As mentioned above, it is based on Viola & Jones face detection method [2]. So the detailed descriptions are omitted. Instead, we introduce our system setting in detail. The first step is to obtain the weak classifiers from the adaptive boosting. As shown in Figure 3, the algorithm extracts one weak classifier from one time iteration.
390
J. Kim et al.
Fig. 3. AdaBoost training
Fig. 4. Example of Weak Classifiers Extraction
During the iteration of training processing, we get the weak classifiers as it updates the weights. Figure 4 shows an example. We know that most of the weak classifiers are extracted from vehicles’ lower areas. It is because the appearance of rear tires is common of our vehicle patches. In input image, the appearing of this part is important for deciding detection rate. The size of the detection window is decided at the Initialization of the training process, and that size determines the computing time of system and the detection rate. Table 1 shows the system performance according to window size. It is tested by 3,330 image frames.
Stereo Vision-Based Improving Cascade Classifier Learning for Vehicle Detection
391
Table 1. Performance by Detection Window Size
Window Size 16 16 18 18 20 20 22 22 24 24 28 28
Frame/Sec Ave. 33 Ave. 32 Ave. 30 Ave. 22 Ave. 16 Ave. 14
Max. F-Measure 0.82 0.87 0.93 0.89 0.92 0.91
The equations below define recall, precision, and F-Measure [10]. We used the information in Table 1 to choose a window 20 by 20 pixels.
The next step is to combine the weak classifiers into strong classifiers. 2.2 Vehicle Detection 1) Image Pyramid Input: In road traffic images, the size of vehicles is judged by perspective. For scale invariant detection, we use an image pyramid (or scale-space) when inputting the images to the classifier. Figure 5 shows an example.
Fig. 5. Image Pyramid (Scale-Space)
392
J. Kim et al.
The parameter R is the ratio between the upper and lower layers of the pyramid, and affects the speed and performance of the entire system. If the rescale ratio is small, detection accuracy increases, but the broader search range increases searching time too. If R is too small, the opposite is true. Table 2 shows the performance according to rescale rate. We chose 1.2 as our parameter. Table 2. Performance by Image Pyramid Rescale-Rate
Recale-Rate 1.05 1.1 1.2 1.3
Frame/Sec 9 17 29 34
False-Positive 462 161 120 82
False-Negative 12 21 68 97
2) Cascade Classifier: Figure 6 shows the cascade structure of the classifier. Each cascade has a strong classifier.
Fig. 6. Cascade Classifier Table 3. Performance by Cascade Level
Cascade 10 12 14 16 18
Frame/Sec Ave. 36 Ave. 31 Ave. 30 Ave. 21 Ave. 14
False-Positive 2041 327 121 110 119
False-Negative 8 24 22 68 81
When an image enters the first cascade, its strong classifier searches for vehicles; if it judges that there is a vehicle, it passes the window to the next cascade, but rejects it if it does not. The cascade structure reduces the processing time and increases the accuracy of detection results. It is because the strong classifier of next cascade does
Stereo Vision-Based Improving Cascade Classifier Learning for Vehicle Detection
393
not search a vehicle on the window rejected by prior cascade. However, the cascade level also determines both the processing speed and the detection performance of the classifier, as shown in Table 3. We chose 14 levels of cascade from this. 2.3 Proposed Method We propose overcoming the AdaBoost classifier’s disadvantages, discussed in the Introduction, by using a stereo vision technique. Figure 7 represents our proposed method and the entire system. Two images are input from the stereoscopic CCD sensors. In the stereo matching part, we compute the depth information using matching cost calculating and optimizing. The matching cost-calculating is based on normalized cross correlation (NCC), and the optimizing uses a global matching method based on hierarchical belief propagation (HBP). This does require a heavy computational burden. So our approach uses the implemented hardware for stereo matching part [7]. Because of this hardware implementing, the real-time processing is possible in stereo matching part. Armed with the depth maps, we can set the RoI using a v-disparitybased method [8]. The detection window does not search on the irrelevant region outside the RoI, thus not only saving the time that would otherwise have been spent searching it, but also obviating the possibility of finding any false positives in it. We have an added advantage in that the depth information has given us accurate locations of the detected vehicles. This accurate distance information would be makes the automatic collision avoidance system.
Fig. 7. Entire System Flow Chart
3 Experiments For the experiments, we used a computer with Intel Core2 Quad™ 2.67GHz CPU, 4GB RAM, running Windows 7 Pro™. The matching part (H/W) is implemented on Xilinx Vertec-5 FPGA and the rest by Microsoft Visual Studio 2005™. For real-time processing we use the openCV library and multi-threading technique. The input image is a grayscale VGA (640 480 pixels). Figure 8-11 show the experimental results for the system’s performance. Four scenarios were tested: of the 2356 frames used, 401 had a complex background (Figure 8), 353 showed various vehicle sizes (Figure 9), 1001 showed various vehicle poses (Figure 10), and 601 included multiple vehicles (Figure 11). For all scenarios, the vehicle detection system based on stereo vision performs better than one with mono vision. The first images show false-positives among the mono vision results; the second images use the same frames, and the stereo system has detected no false positives at all.
ⅹ
394
J. Kim et al.
Fig. 8. Complex Background Scenario Results
Fig. 9. Size Variation Scenario Results
Stereo Vision-Based Improving Cascade Classifier Learning for Vehicle Detection
Fig. 10. Pose Variation Scenario Results
Fig. 11. Multi Objects Scenario Results
395
396
J. Kim et al.
The graphs show ROC (receiver operating characteristic) curves. The vertical axis represents “precision” while the horizontal axis represents “recall” (the formulae for evaluating these were given in 2-2.1-2). A perfect result would be (1, 1), so the system with better detection performance would be that with the closest approach to (1, 1). The graphs of stereo vision are closer to (1, 1). It is prove the superior of our method. The tables in Figure 8 show the average computing time and F-measure (the formula for this was also given in 2-2.1-2). In all scenarios, the system using stereo vision has a shorter processing time and better detection rate, as evidenced by the F-measure. We show some example videos on YouTube site [11-16]
4 Conclusion Recently, according to the 3D TV market is growing, the interests in stereo vision have been growing too. The stereo vision technology was derived from the structure and function of the human eye. Both stereo detectors and eyes can gather information about the distance of objects, a property useful for intelligent vehicle research. By making use of stereo vision to set a RoI, we have achieved an improvement in detection accuracy and speed, and can define the locations of detected vehicles. Like ours, the depth information from stereo vision technique will be utilized more for the intelligent vehicle design. Acknowledgements. This work was supported by the Daegu Gyeongbuk Institute of Science and Technology R&D Program of the Ministry of Education, Science and Technology of Korea(11-IT-01).
References 1. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the 13th International Conference on Machine Learning, ICML 1996, pp. 148–156 (1996) 2. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, CVPR 2001, pp. 511–518 (2001) 3. Khammari, A., Nashashibi, F., Abramson, Y., Laurgeau, C.: Vehicle detection combining gradient analysis and AdaBoost classification. In: Intelligent Transportation Systems Conference, pp. 66–71 (2005) 4. Alefs, B.: Embedded Vehicle Detection by Boosting. In: Intelligent Transportation Systems Conference, pp. 536–541 (2006) 5. Alefs, B., Schreiber, D.: Accurate Speed Measurement from Vehicle Trajectories using AdaBoost Detection and Robust Template Tracking. In: Intelligent Transportation Systems Conference, pp. 405–412 (2007) 6. Premebida, C., Ludwig, O., Silva, M., Nunes, U.: “A cascade classifier applied in pedestrian detection using laser and image-based features. In: Intelligent Transportation Systems Conference, pp. 1153–1159 (2010) 7. Kwon, S., Lee, C.-H., Lim, Y.-C., Lee, J.-H.: A sliced synchronous iteration architecture for real-time global stereo matching. In: Proc. Of SPIE-IS&T Electronic Imaging, SPIE vol. 7543(754312-1) (January 2010)
Stereo Vision-Based Improving Cascade Classifier Learning for Vehicle Detection
397
8. Lee, C.-H., Lim, Y.-C., Kwon, S., Lee, J.-H.: Stereo vision-based vehicle detection using a road feature and disparity histogram. Optical Engineering, 50(2) (February 2011) 9. Lim, Y.-C., Lee, M., Lee, C.-H., Kwon, S., Lee, J.-H.: Improvement of stereo vision-based position and velocity estimation and tracking using a stripe-based disparity estimation and inverse perspective map-based extended Kalman filter. Optics and Lasers in Engineering 48, 859–868 (2010) 10. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979) 11. http://www.youtube.com/watch?v=asfHLxQMhIw 12. http://www.youtube.com/watch?v=F1ef3Oey0qQ 13. http://www.youtube.com/watch?v=API5JC9_mDo 14. http://www.youtube.com/watch?v=dH8Onu9LIo8 15. http://www.youtube.com/watch?v=qbLS9ghoU3o 16. http://www.youtube.com/watch?v=kN1pVN8vNTg
Towards a Universal and Limited Visual Vocabulary Jian Hou1 , Zhan-Shen Feng1 , Yong Yang2 , and Nai-Ming Qi2 1 2
School of Computer Science and Technology, Xuchang University, China, 461000 School of Astronautics, Harbin Institute of Technology, Harbin, China, 150001
Abstract. Bag-of-visual-words is a popular image representation and attains wide application in image processing community. While its potential has been explored in many aspects, its operation still follows a basic mode, namely for a given dataset, using k-means-like clustering methods to train a vocabulary. The vocabulary obtained this way is data dependent, i.e., with a new dataset, we must train a new vocabulary. Based on previous research on determining the optimal vocabulary size, in this paper we research on the possibility of building a universal and limited visual vocabulary with optimal performance. We analyze why such a vocabulary should exist and conduct extensive experiments on three challenging datasets to validate this hypothesis. As a consequence, we believe this work sheds a new light on finally obtaining a universal visual vocabulary of limited size which can be used with any datasets to obtain the best or near-best performance.
1
Introduction
Representing an image with a bag-of-visual-words has become a popular paradigm and attained success in many image processing tasks such as object classification and image retrieval. In this approach, salient image regions (keypoints) in training images are detected and described with descriptors. These descriptors are then pooled together and clustered into a number of groups. By treating each group as a visual word, we can represent an image as a distribution over the set of visual words [1,2]. The basic bag-of-visual-words representation ignores the spatial relationships among keypoints, which has been shown to be useful in object recognition and classification tasks [3,4]. To encode spatial information in the representation, [5] proposes to partition an image in a pyramidal fashion and compute a histogram in each sub-region. This spatial pyramid matching method is shown to produce superior classification results on several image datasets. The problem has been addressed in [6,7] with different approaches. For a given dataset, some visual words in the vocabulary may be more informative than the others. This feature has been exploited to design various weighting schemes for visual words [8,9,4] and reduce the vocabulary size [10,11]. In order to improve recognition efficiency, [9] designs a vocabulary tree by hierarchical k-means clustering, which is shown to be well adapted to a very large vocabulary and dataset. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 398–407, 2011. c Springer-Verlag Berlin Heidelberg 2011
Towards a Universal and Limited Visual Vocabulary
399
Although bag-of-visual-words has been studied in various aspects as we reviewed above, the determination of an appropriate vocabulary size for a given dataset is rarely touched. Most existing works select the vocabulary size empirically [1,12,5,13] and the adopted sizes range from hundreds to tens of thousands. There is still no guidance on selecting a vocabulary size to obtain the best performance. This further implies that the research is still implicitly based an assumption that the optimal visual vocabulary is data dependent. This assumption explains why with a given dataset, researchers choose to train a vocabulary from the dataset, instead of using an existing universal vocabulary Experiments in [4,14] indicate that the vocabulary size has a significant impact on the performance. For a given dataset, there seems to exist an optimal vocabulary size and either a smaller or a larger size leads to a deviation from the best performance. The existence of an optimal vocabulary size actually implies that if two features are similar enough, they should be treated as one single visual word, but not two visual words separately, to obtain the best performance. In previous work [15] we expresses this conclusion with a clustering threshold thc , where the cosine similarity of one visual word with all features it represented is above thc . From these research we found that one type of features (defined by thc ), instead of one single feature, should be treated as a visual word to obtain the best performance. This, in turn, indicates that with a given descriptor, all possible image patterns can be mapped to a limited vocabulary. This vocabulary can then be used as a universal, data independent vocabulary in image processing tasks to obtain optimal performance. After we started the work in this paper, we noticed an independent work [16] that addresses the problem of deriving a universal vocabulary. We’d like to highlight the difference between our work and the paper [16]. In [16] the authors wonder if one has to train a new vocabulary for a specific dataset instead of using an existing universal vocabulary. They then empirically found that when the amount of training images is large enough, the vocabulary trained from one dataset can be used on other datasets without apparently harming the performance. While in our paper, the existence of a universal vocabulary is a natural hypothesis derived from our previous research on optimal vocabulary size [15] and we conduct experiments to validate the hypothesis. In [16] the vocabulary size is user-defined. This means that an inappropriate selection of vocabulary size may yield a vocabulary which is universal but perform moderately. In our work, the vocabulary size is automatically selected to be the optimal one. To sum up, the vocabulary obtained with our approach is universal, optimal and compact, in that it can be used on different datasets to obtain the optimal or sub-optimal performance with a small computation load (the vocabulary size is one to several thousands). This paper is structured as follows. In Section 2 we briefly review and improve the method of determining the optimal vocabulary size described in [15], which serves as the basis of this paper. Section 3 presents our work on exploring the possibility of obtaining a universal and limited visual limited vocabulary. Section 4 concludes the paper.
400
2
J. Hou et al.
Optimal Vocabulary Size
In [4] the authors conclude through extensive experiments that for one dataset, there exists an optimal vocabulary size. Smaller sizes lead to dramatic decrease in classification performance and larger size levels off or decrease mildly the performance. The observation that the optimal size is smaller than the largest size (i.e., the total number of training descriptors) implies that there exists some criterion on when a set of features should be mapped to the same visual word. [15] models this criterion by a clustering threshold thc and a similarity based clustering procedure. 2.1
New Clustering Procedure
Unlike k-means clustering, here the number of clusters is controlled by thc . The clustering procedure requires that the cosine similarity of all features in a cluster with their center to be above thc . [15] presented a simple procedure to implement this similarity based clustering and used the resulted number of clusters as the optimal vocabulary size. However, just as the authors pointed out in the same paper, the clustering procedure in [15] is not stable and different order of descriptors lead to different number of clusters. The reason lies in that the constraint for the similarity based clustering procedure is not strict. Besides the requirement that the similarity of all features with their center be above thc , the number of resulted clusters should be minimized. Denote the jth feature mapped to cluster i by fij , and the center of cluster i by fic , the problem can be stated as min s.t.
Ncluster S(fij , fic ) > thc
(1)
i = 1, · · · , Ncluster , j = 1, · · · , Ni where Ncluster is the resulted number of clusters, Ni is the number of features in cluster i, and S(., .) means the cosine similarity of two features. Based on this new constraint, we improve the clustering procedure to be as follows: 1. Label all training descriptors as ungrouped. 2. Label the first ungrouped descriptor as the center of one cluster. 3. Compare each ungrouped descriptors with the center, and add it into the cluster if the similarity is above thc . 4. Return to Step 2 until all descriptors are grouped. 5. Calculate the new center of each group, and use the number of descriptors in the group as the weight. 6. Sort the centers by the weight in decreasing order. 7. Compare all descriptors with each center in order and add to corresponding cluster if the similarity is above thc . 8. If there are descriptors left ungrouped, repeat Step 2 to 3 to cluster them into new groups. 9. Repeat Step 5 to 8 for a certain times.
Towards a Universal and Limited Visual Vocabulary
401
Step 1 to 4 describes the original clustering procedure presented in [15]. By adding the iterations from Step 5 to 8, we enforce that the cluster centers are concentrated in high density areas of the feature space and thus reduce the number of clusters. One may argue that this clustering procedure is not guaranteed to converge and minimize the number of clusters. However, in all our experiments the number of clusters tends to be stable after 5 iterations. Recall that the visual words performance is not very sensitive to small change in vocabulary size, in this paper we use the results of 10 iterations as the optimal vocabulary size in all experiments. In [15] the authors derive the optimal clustering threshold as 0.8 empirically. As we use a new and improved version of similarity based clustering, in the first step we need to derive a new optimal clustering threshold, and confirm that it really produces optimal vocabulary sizes for different datasets. 2.2
Experiments with New Clustering Procedure
Unlike [15] deriving the optimal clustering threshold through straightforward local descriptors matching, here we adopt a more straightforward way. Firstly, we use several clustering thresholds to produce corresponding vocabulary sizes, and select the one that performs the best. In the second step we compare the selected size with other candidate sizes to see if this size is the optimal one. We’d like to point out here that the similarity based clustering is only used to determine the vocabulary size. In vector quantization with all vocabulary sizes, the clustering is done with k-means method. By doing so, we ensure that the performance difference is not due to different clustering methods. We use three diverse datasets in our SVM classification experiments. The first one is Caltech-101 [17], where we randomly select 30 images from each object class and split them into 15 training and 15 testing. The second dataset is Scene-15 [18,19,5] with images of 15 scene categories with 200 to 400 images in each category. Figure 1 shows some example images and introduction. We use 100 randomly selected images per class for training and all the others for testing. The Event-8 dataset [20] is adopted as our third dataset, which contains 8 sports events categories with 130 to 250 images in each category. See Figure 2 for sample images and introduction.In experiments 70 images per class are used as training and 60 other images as testing. For efficiency reason the images are all compressed in size. In all experiments we use SIFT keypoints and descriptors [21]. We build bag-of-visual-words histograms on the whole image (i.e., at spatial pyramid level 0). Since it is shown in [4] that inverse document frequency does not improve classification performance, here we use the simple binary (bi) and term-frequency (tf ) weighting schemes to build linear kernels in multi-class SVM classification trained with one-versus-all rule. In experiments we use 3 trainingtesting splits and report the average of percentages of images classified correctly. Note that in all our experiments we use visual words without spatial information or special kernel, therefore we do NOT expect to obtain superior classification performance comparable to the stage-of-the-art. What really counts here is the trend of recognition rates with respect the vocabulary sizes.
402
J. Hou et al.
Fig. 1. Sample images of Scene-15 dataset. Two images per category are displayed with five categories in one row. From left to right and top to bottom, the categories are bedroom, suburb, industrial, kitchen, living room, coast, forest, highway, inside city, mountain, open country, street, tall building, office and store.
Fig. 2. Sample images of Event-8 dataset. Two images per category are displayed with four categories in one row. From left to right and top to bottom, the categories are badminton, bocce, croquet, polo, rock climbing, rowing, sailing and snow boarding.
Firstly we compare the performance of 4 candidate clustering thresholds 0.7, 0.75, 0.8 and 0.85. The four sizes calculated with similarity based clustering are 544, 2323, 12593 and 88328 for Caltech-101, and 455, 1790, 9208 and 59539 for Event-8, and 560, 2378, 13124 and 92735 for Scene-15. The classification rates with bi and tf are reported in Figure 3, where we use clustering thresholds instead of the specific vocabulary sizes to show the trend more evidently. We then compare the vocabulary sizes from optimal clustering thresholds with other sizes 100, 1000, 10000, 50000 and 100000 to check if it still performs the best. The results are shown in Table 1. As the optimal sizes corresponding to clustering thresholds 0.75 and 0.8 are different for different datasets, in the leftmost column we use thc = 0.75 and thc = 0.8 to represent their respective sizes.
Towards a Universal and Limited Visual Vocabulary
60
60
50
50
Recognition rate
70
Recognition rate
70
40
40
30
30
20
10 0.65
403
20
Caltech−101 Event−8 Scene−15 0.7
0.75
0.8
0.85
10 0.65
0.9
Clustering threshold
Caltech−101 Event−8 Scene−15 0.7
0.75
0.8
0.85
0.9
Clustering threshold
(a) bi weighting
(b) tf weighting
Fig. 3. Recognition rates of different clustering thresholds. With all three datasets and two weighting schemes the clustering thresholds 0.75 and 0.8 produces the best or near-best performance. Table 1. Classification rates of different vocabulary sizes with bi and tf weighting. The sizes corresponding to clustering thresholds 0.75 and 0.8 produce the best or near-best results in all cases. caltech-101 bi tf 100 8.4 17.6 25.0 24.0 1000 10000 25.4 24.5 50000 25.5 23.4 100000 25.1 23.5 thc = 0.75 26.1 26.3 thc = 0.8 26.0 25.1
Event-8 bi tf 31.3 49.0 56.3 57.5 58.1 59.2 57.9 56.5 55.6 55.0 55.8 57.9 60.0 59.8
Scene-15 bi tf 30.7 53.0 57.0 58.2 59.4 58.8 57.0 55.4 56.2 54.8 57.8 57.8 59.4 58.1
It’s evident from Table 1 that the vocabulary sizes from thc = 0.8 or its closest neighbor 10000 performs the best or near-best among all sizes. This confirm that thc = 0.8 does produce the optimal vocabulary size. We also note that the performance of thc = 0.75 or its closest neighbor 1000 is rather similar to thc = 0.8. Taking into account the small performance difference and large size difference between 0.8 and 0.75, we recommend to select 0.75 as the optimal clustering in practical applications. Furthermore, it has been shown that the performance of a visual vocabulary is not very sensitive to its size, only if the size is not too small. Our experiments indicate that for a common dataset of about the size of Caltech-101, 1000 or 2000 might be a suitable vocabulary size. Adopting a larger size usually does not pay off.
3
Universal Visual Vocabulary
Traditionally we think that the optimal visual vocabulary is data dependent. However, the existence of an optimal vocabulary size smaller than the number of
404
J. Hou et al.
training descriptors implies that when some descriptors are similar enough, they should be clustered into one group and represented by one visual word. And, the optimal clustering threshold sets a criterion for descriptors to be mapped to the same visual word [15]. Let’s say a descriptor represents an image pattern and a cluster determined by thc represents one type of image patterns. We know there are a myriad of image patterns. However, with all possible image patterns, the number of image pattern types is limited. This, in turn, means that a universal and limited and optimal vocabulary should exist. Theoretically it’s possible to enumerate all the possible visual words with the optimal clustering threshold thc . However, we are not sure if all these image patterns (corresponding to all visual words) will appear frequently in images. In other words, by enumerating all possible image pattern types we may obtain a visual vocabulary that is complete but of a very large size. Nevertheless many of these image patterns may rarely appear in real images. This causes unnecessary computation load. Therefore in this paper we resort to empirical methods. It’s out of our scope in this paper to produce such a universal vocabulary. Instead, we will show empirically that obtaining such a universal vocabulary is not only theoretically sound, but practically feasible. Recall that in last section we have computed the optimal vocabularies for three datasets, which we refer to as voc-caltech, voc-event and voc-scene respectively. We will interchange the roles of datasets and vocabularies to check if different vocabularies produce a large difference in performance on the same datasets. Take voc-caltech for example, we use it on Event-8 and Scene-15 and see if it performs comparably to voc-event and voc-scene respectively. The comparison is shown in Figure 4. Contrary to the traditional viewpoint that a good vocabulary is data dependent, we found from the comparison in Figure 4 that with each dataset, the vocabulary trained with three datasets performs rather similarly. This seems to imply that the vocabularies trained from different datasets have a rather larger portion of visual words in common. In order to validate this observation, we calculate the pairwise similarity between three vocabularies. Specifically, for each visual word in one vocabulary, we compute its cosine similarity with its closest counterpart in the other vocabulary. For all 6 cases Caltech-Event, EventCaltech, Caltech-Scene, Scene-Caltech, Event-Scene and Scene-Event, almost all visual words have a > 0.9 similarity with their counterparts in other vocabularies, and over 60% of the visual words have a > 0.95 similarity. These results further confirm that the three vocabularies are very similar to each other. This is interesting, since almost identical vocabularies are obtained from three different datasets. This observation, together with the fact that all three datasets consist of various and diverse types of objects, lead us to believe that there does exist a universal visual vocabulary. The difference in appearances of images is only caused by the different distribution of visual words in the vocabulary. In [16] the authors conclude that with a given vocabulary size large enough, the vocabularies trained from different datasets are exchangeable without harming the classification performance evidently. Therefore a large vocabulary need to be computed only once. It’s not clear when the vocabulary size is large enough.
Towards a Universal and Limited Visual Vocabulary 60
50
60 voc−caltech voc−event voc−scene
50
voc−caltech voc−event voc−scene
Recognition rate
40
Recognition rate
40
30
30
20
20
10
0
10
Caltech−101
Event−8
0
Scene−15
Caltech−101
Testing dataset
(a) bi weighting, thc = 0.75
Scene−15
(b) tf weighting, thc = 0.75 60
voc−caltech voc−event voc−scene
50
voc−caltech voc−event voc−scene
Recognition rate
40
Recognition rate
40
30
30
20
20
10
0
Event−8
Testing dataset
60
50
405
10
Caltech−101
Event−8
Scene−15
Testing dataset
(c) bi weighting, thc = 0.8
0
Caltech−101
Event−8
Scene−15
Testing dataset
(d) tf weighting, thc = 0.8
Fig. 4. Recognition rates with vocabularies trained from different datasets. x-axis represents different testing datasets, and different bars indicate vocabularies trained from different datasets. It’s clear that different vocabularies perform similarly on the same datasets. Do not compare our results with the state-of-the-art on these datasets since it’s not our aim in this paper.
In this paper we arrive at much more powerful conclusions. When we say an optimal vocabulary is universal, our meaning is threefold. Firstly, the vocabulary can be used on other datasets to obtain comparable performance as their specific vocabulary. Secondly, our vocabulary is optimal in that it can produce the optimal performance on any datasets. Thirdly, our optimal vocabularies are of limited size (1000 to several thousands). This not only means efficiency in classification, but implies that a very large vocabulary is not necessary at all. To sum up, we provide an approach to produce a vocabulary that is universal, optimal and compact. Although currently we only experiment on three datasets, we also note that all three datasets contain objects of diverse types and large variation, and are thus rather representative. In the next step we will extend the experiments to more datasets, like Caltech-256 [22], Oxford flowers [23], NUS-WIDE [24] and Graz [25], etc., in order to finally produce a universal visual vocabulary, which can be used in a large number of datasets for the best or near-best performance.
406
4
J. Hou et al.
Conclusion
Previous research on bag-of-visual-words have found that when features are similar enough, they should be represented by one visual words to obtain the best performance. This property is then modeled by an optimal clustering threshold and a similarity based clustering method. These work implies that the number of optimal visual words is limited and there exists a universal visual vocabulary. In this paper we improve the previous work and conduct extensive experiments on three challenging datasets to validate this hypothesis. Experimental results show that three vocabularies of limited sizes trained from three datasets are very similar to each other, and any of them can be used to generate the best or near-best performance with all three datasets. This encouraging result indicates that with more datasets involved, it’s really feasible to obtain a universal and limited visual vocabulary, to be used in any datasets to generate optimal performance. This work further narrows the gap between bag-of-visual-words and bag-of-words, which is its predecessor and counterpart in text domain.
References 1. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, pp. 1470–1477 (2003) 2. Grauman, K., Darrell, T.: The pyramid match kernel: Discriminative classification with sets of image features. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1458–1465 (2005) 3. Lazebnik, S., Schmid, C., Ponce, J.: A maximum entropy framework for part-based texture and object recognition. In: IEEE International Conference on Computer Vision, pp. 832–838 (2005) 4. Yang, J., Jiang, Y., Hauptmann, A., Ngo, C.: Evaluating bag-of-visual-words representations in scene classification. In: International Workshop on Multimedia Information Retrieval, pp. 197–206 (2007) 5. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178 (2006) 6. Marszalek, M., Schmid, C.: Spatial weighting for bag-of-features. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2118–2125 (2006) 7. Viitaniemi, V., Laaksonen, J.: Spatial extensions to bag of visual words. In: ACM International Conference on Image and Video Retrieval (2009) 8. Cai, H., Yan, F., Mikolajczyk, K.: Learning weights for codebook in image classification and retrieval. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2320–2327 (2010) 9. Nister, D., Stewenius, H.: Scale recognition with a vocabulary tree. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2161–2168 (2006) 10. Li, T., Mei, T., Kweon, I.S.: Learning optimal compact codebook for efficient object categorization. In: IEEE 2008 Workshop on Applications of Computer Vision, pp. 1–6 (2008)
Towards a Universal and Limited Visual Vocabulary
407
11. Mallapragada, P., Jin, R., Jain, A.: Online visual vocabulary pruning using pairwise constraints. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 3073–3080 (2010) 12. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: An in-depth study. Technical report, INRIA (2003) 13. Zhao, W., Jiang, Y., Ngo, C.: Keyframe retrieval by keypoints: Can point-to-point matching help? In: ACM International Conference on Image and Video Retrieval, pp. 72–81 (2006) 14. Deselaers, T., Pimenidis, L., Ney, H.: Bag-of-visual-words models for adult image lassification and filtering. In: International Conference on Pattern Recognition, pp. 1–4 (2008) 15. Hou, J., Kang, J., Qi, N.M.: On vocabulary size in bag-of-visual-words representation. In: The 2010 Pacific-Rim Conference on Multimedia, pp. 414–424 (2010) 16. Ries, C.X., Romberg, S., Lienhart, R.: Towards universal visual vocabularies. In: International Conference on Multimedia and Expo., pp. 1067–1072 (2010) 17. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: CVPR, Workshop on Generative-Model Based Vision, p. 178 (2004) 18. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42, 145–175 (2001) 19. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 524–531 (2005) 20. Jia, L.L., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: IEEE International Conference on Computer Vision, pp. 1–8 (2007) 21. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 22. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical report 7694, Caltech (2007) 23. Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: IEEE International Conference on Computer Vision, pp. 1447–1454 (2006) 24. Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: A real-world web image database from national university of singapore. In: ACM International Conference on Image and video retrieval, pp. 1–9 (2009) 25. Opelt, A., Fussenegger, M., Pinz, A., Auer, P.: Weak hypotheses and boosting for generic object detection and recognition. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 71–84. Springer, Heidelberg (2004)
Human Body Shape and Motion Tracking by Hierarchical Weighted ICP Jia Chen1 , Xiaojun Wu1 , Michael Yu Wang2 , and Fuqin Deng3 1
Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China 2 The Chinese University of Hong Kong, Shatin, NT, Hong Kong, China 3 The University of Hong Kong, Pokfulam Road, Hong Kong, China {chenjia,wuxj}@hitsz.edu.cn, [email protected]
Abstract. We present a new approach for tracking both the human body shape and the whole body motion with complete six DOF of each body limb without imposing rotation or translation constraints. First, a surface mesh with highly improved quality is obtained by using our new silhouette-based visual hull reconstruction method for each frame of multi-view videos. Then, a skinned mesh model is fitted to the data using hierarchical weighted ICP (HWICP) algorithm, where an easy-toadjust strategy for selecting the set of ICP registration points is given based on the weights of the skinned model and the Approximate Nearest Neighbors (ANN) method is applied for fast searching nearest neighbors. By comparing HWICP with the general hierarchical ICP (Iterative Closest Point) method based on synthetic data, we demonstrate the power of weighting corresponding point pairs in HWICP, especially when adjacent body segments of target are near ‘cylindrical-shaped’.
1
Introduction
The 3D tracking of human body, traditionally known as the Motion Capture, is applied in a variety of fields such as character animation, motion generation of a humanoid robot, gesture based human-machine interaction, biomechanical analysis, ergonomics and surveillance. Currently, marker-based (including optical, inertial, mechanical and magnetic) motion capture technology has been widely applied in a large number of existing commercial systems. However, these systems have several main drawbacks: they are expensive, obtrusive and require a complex, tedious and time-consuming experimental setup. While as an attractive non-invasive alternative solution, markerless motion capture technology has been a highly active research area for the last decade [17,12], since this technology does not require users to wear special markers, garments or gloves for tracking and it is not restricted to motion information associated with markers. In this paper, a new markerless motion capture technique based on hierarchical weighted ICP algorithm is proposed that can track the complete six DOF (degrees of freedom) movement of individual human body limbs from multiple video streams without imposing rotation or translation constraints. As shown in Fig. 1, we obtain the 3D body shape tracking concurrently. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 408–417, 2011. c Springer-Verlag Berlin Heidelberg 2011
Human Body Shape and Motion Tracking by Hierarchical Weighted ICP
(a)
(b)
(c)
(d)
409
(e)
Fig. 1. Tracking pipeline of each frame: (a) Target (b) Silhouette (c)Visual hull (d) 3D human body shape tracking (e) Motion tracking
1.1
Related Works
Accurate and robust 3D human motion tracking from videos is a challenging task and has been a long standing goal of Computer Vision and 3D Computer Graphics. Various vision-based systems have been proposed for tracking human motion in the past years. There are some good reviews of these methods [17,12]. The existing methods for markerless motion capture systems vary in the number and setup of cameras (single camera or multiple cameras), kinematic models and shape models of human body, image descriptors used, the representation of captured data, types of tracking algorithms, and the application to whole body model or part. According to [12], the class of Direct Model Based Multiple views 3D Pose Estimation approach has attracted great attention in the literature. Based on a kinematic (sometimes shape and appearance also involved) representation of the human body, this class of methods estimates the pose at time t from the pose at time t − 1. The main differences among these Model Based Multiple views algorithms are the adoption of kinematic model, shape model, image descriptor and optimization or filter technique [17]. Several important works in the past five years will be discussed below. For whole body pose estimation with 24 DOF from multiple views, Kehl et al. introduce a stochastic search strategy during the optimization in [7], which makes the iteration to avoid convergence into local minima. Their kicking capture experiment demonstrates that the stochastic search strategy improves the robustness of the system. But they only use the coarse voxel model as reconstruction model for fitting, which is not enough for accurate motion capture. Ogawara et al. use the articulated ICP method for motion estimation, where a robust estimator and k-d tree search in pose and normal space enable the system to track dynamic motion robustly against noise [16]. But it cannot obtain the complete six DOF movement of individual joints of the human body, which is crucial to provide good tracking as revealed in [5]. Also, it is not described in their work that how to select the set of registration points for each body segment. It requires a lot of trials to get the best option if the task is conducted manually. By contrast, based on skinned weights, an easy-to-adjust strategy for selecting the points during ICP registration is proposed in our contribution. Based on both silhouettes and optical flow, Ballan et al. [2] implement the markerless motion capture of skinned model in a four camera set-up where the
410
J. Chen et al.
generic Levenberg-Marquardt optimization method is used. Nevertheless, each element θj of pose vector θ is needed to be imposed with the constraints {θj,min ≤ θj ≤ θj,max }, which are not easy to be obtained accurately. Gall et al. [6] introduce a multi-layer framework that combines stochastic optimization, filtering, and local optimization and their experiment demonstrates the significant performance improvement of the multi-layer framework. However, this combination is more computational expensive. When the number of cameras Cn ≥ 8 and cameras are in the most favorable configuration [15], Corazza et al. [5] obtain six DOF of each body segment and accurate human motion measurements using articulated ICP with rotation bounds. In particular, the system can obtain a subject specific model of the subjects using an automatic model generation algorithm [4]. Compared with [5], we introduce the weighting strategy to articulated ICP that good human motion tracking can be obtained without using rotation or translation bounds. Besides, we obtain the complete 3D body shape tracking based on a skinned model. Furthermore, we give an easy-to-adjust strategy for selecting the set of registration points using the skinned weights. Corazza in [5] also shows that the HumanEva dataset [18] is not suitable for evaluating this type of Model Based Multiple views 3D Pose Estimation methods due to its bad cameras configuration (number and position). For this reason, we design a flexible motion capture technology evaluation platform based on the powerful CG software—Autodesk 3ds Max, where the cameras configuration is easy to adjust. Moreover, a wide range of human motion can be simulated. From various experiments, We find that the general hierarchical ICP method tends to fail for tracking the adjacent body segments when they are near ‘cylindrical-shaped’. So in this paper, the HWICP algorithm is presented for robust 3D body shape and motion tracking no matter whether the ‘cylindrical-shaped’ occurs or not. In addition, similar to [13], WEOP is adopted to solve for Euclidean transformation in our HWICP algorithm. The remainder of this paper is organized as follows. In Section 2, we use our new method for visual hull computing. Section 3 describes the skinned mesh model which is based on linear blend skinning technique. Section 4 presents the new HWICP algorithm for both 3D body shape and motion tracking. Section 5 shows the experimental results of the approach. Finally, Section 6 draws the conclusion.
2
Visual Hull Computation
A setup composed of eight calibrated cameras is simulated using the 3ds Max software. As shown in Fig. 2, eight video streams are simultaneously captured and then foreground silhouettes are obtained based on background subtraction method for each frame. Like most of the previous work [17], actually only silhouette information is used in this research. Visual hull [8] is used as 3D feature for each frame of the multi-view synchronized videos. In particular, the quality of visual hull is one main element in determining the quality of motion capture [5]. Here we use our new method
Human Body Shape and Motion Tracking by Hierarchical Weighted ICP
411
for computing visual hull mesh [19] from the eight view silhouettes, where a simple and efficient voxel projection test strategy is proposed for visual hull octree construction to avoid ambiguity. From only eight silhouettes, the reconstructed visual hull when using general uniform grid combined with marching cube method is shown in Fig. 3(a), and with the same 7 levels voxel resolution of Bounding Box space, the reconstructed result of our method is shown in Fig. 3(b) for comparison. Partial enlarged drawing in Fig. 3(c) demonstrates that the result of our method preserves local details and is smoother.
Fig. 2. Foreground segmentation
3
Fig. 3. Visual hull reconstruction
Human Kinematic and Shape Model
Human shape model with hierarchical skeleton, used as a prior information, is fitted to the reconstruction data for each frame during tracking. The literature described in the review [17] represent the prior human body information mostly as simple shape primitives (e.g. sticks, cylinders and ellipsoids [11,20,13]), which inevitably result in the mismatch between the prior shape model and the reconstruction. Here, skinned mesh model [10] is used in our tracking approach and it is composed of the internal bone model that represents the kinematic structure and the skin model that represents the surface shape of human body as shown in Fig. 4(a). After fitting, the motion of each bone is obtained using subsequently introduced HWICP algorithm. In addition, the human shape is deformed to new frame using the famous skeleton driven deformation method—Linear Blend Skinning (LBS) technique [10]. LBS allows the movement of surface vertices to be determined by more than one joint as described in (1). k wi Ti )vt−1 vt = ( i=1
(1)
where k is the number of bones, Ti is a homogeneous transformation matrix of bone i, vt−1 is a skin vertex in last position, and vt is the vertex after deformation. According to the corresponding vertex weights wi , the movement of bones force the vertices to be transformed, which ultimately brings about skin deformation. As shown in Fig. 4(b), LBS makes the deformation become natural.
412
J. Chen et al.
(a)
(b)
Fig. 4. Articulated body model: (a) Surface model (semitransparently shown in openGL) and internal skeleton (b) LBS deformation
4 4.1
Fig. 5. Two class areas distinguished in the weighting: the first class area such as the armpit and crotch and the second class area such as the hand
The Proposed Tracking Algorithm Formulation
Because the pose of the reconstructed 3D visual hull from the first frame of eight views and that of the initial skin model are the same for our simulated human motion, assuming that the configuration (or motion state) of the first frame reconstructed visual hull of the human is zero, then the accumulative motion transformation at time t relative to the first frame is just the motion state of time t. Since the visual hull of the target at any time t can be reconstructed by using our method explained in Section 2, given the previous motion state S at time t − 1, then the 3D body shape and motion tracking problem can be formulated as a nonlinear least squares problem of (2). i SkinModel VisualHull 2 vj,t−1 − vj,t E(Ci , Ti ) = min m j=1 Ti⎞ ⎛ c11 · · · c1mi (2) ⎟ ⎜ s.t. Ci = ⎝ ... . . . ... ⎠ ; Ti ∈ SE(3) cn1 · · · cnmi where i is bone index; mi is the number of corresponding point pairs of the ith bone; Ti is the motion of this bone; Ci is the matrix representation of corresponding relationship between vertices of skin model and visual hull (if the pth point and qth point is a corresponding point pair, then cpq = 1; otherwise SkinModel VisualHull cpq = 0); vj,t−1 and vj,t are a pair of corresponding points from skin model and visual hull respectively. 4.2
Hierarchical ICP
When minimizing Eq.(2), if we try to estimate the motions of all bones simultaneously with the basic Iterative Closest Point (ICP) method [3], the terminal bones in the tree structure tend to fall into local minimum and it prevents the other bones from being aligned correctly.
Human Body Shape and Motion Tracking by Hierarchical Weighted ICP
413
Because of natural hierarchical articulated characteristics of the human body, hierarchical ICP is recently used where the skinned root bone is registered with the visual hull first, then other skinned bones are registered with the visual hull hierarchically from them to their descendants. At each step, ICP computes the rigid transformation T of the current limb that best fits the visual hull. 4.3
Hierarchical Weighted ICP
We are not fitting two ideal smooth triangular meshes, although our new visual hull calculation method results of a quality greatly improved surface mesh, it is still a rather rough shape in which the quality of some mesh vertices are good, while some are very noisy, and thus we cannot treat them equally in the registration. We hereby construct another problem formulation (3) by the introduction of weight item wj , and subsequently present the hierarchical weighted ICP (HWICP) algorithm described in Algorithm 1 to solve the problem. i VisualHull 2 E(Ci , Ti ) = min m v SkinModel − vj,t j=1 wj T ⎞ i j,t−1 ⎛ c11 · · · c1mi (3) ⎟ ⎜ s.t. Ci = ⎝ ... . . . ... ⎠ ; Ti ∈ SE(3) cn1 · · · cnmi In HWICP (see Algorithm 1), the corresponding point searching works from the skinned model to the visual hull data, by computing its closest point on the body segments for each vertex whose skinned weights are bigger than weight threshold. Compared with [5], which needs a fixed and manual selection for the set of registration points of each body limb, the method (step 2 of Algorithm 1) is an easy-to-adjust strategy when adopting different thresholds. From our experiment, the best tracking is obtained when the threshold is equal to 0.93. Unlike [16], which uses the standard k-d tree method, we optimize the nearest neighbors searching among points in 3D with the ANN algorithm [14], since we find that ICP spends a significant part of its runtime for nearest point searching. As described in step 5 of Algorithm 1, general corresponding vertex pairs are weighted based on normal compatibility (or normal dot product) except for the vertices in two special class areas which are shown in Fig. 5. Because 3D character model from scanner or image-based reconstruction normally has some inevitable defects, like the armpit and crotch shown in Fig. 5, which is called “the first class area” here. Their weights are set to 0 to eliminate their bad influence on ICP registration. As for the second class area, take the hand as example shown in Fig. 5, normals are changing rapidly here, so vertices are given big weights which are set to be 10 in our HWICP algorithm to enhance their effects during ICP registration. When this weighting strategy is introduced, the experimental results in next Section demonstrate that the system can robustly track the complete six DOF movement of each body limb.
414
J. Chen et al.
Algorithm 1. HWICP (hierarchical weighted ICP) algorithm Input: Skinned model with motion state at time t − 1; Visual hull mesh of the target at time t. Output: Skinned model with motion state at time t. 1: for each body limb of the skinned model (traverse the tree structure using preorder traversal method) do 2: for each vertex in this limb with skinned weights ≥ weight threshold do 3: if current iterate count ≤ max iterate threshold then 4: (a) Searching nearest vertex in visual hull mesh using ANN algorithm[14]; 5: (b) Weighting corresponding pairs of skin model vertex and visual hull; 6: if normal dot product ≤0 or euclidean distance ≥ distance threshold then 7: weight of this vertex is set to 0; 8: else if the vertex is in the first class area then 9: weight of this vertex is set to 0; 10: else if the vertex is in the second class area then 11: weight is set to a big value; 12: else 13: weight is set to normal dot product; 14: end if 15: (c) Applying WEOP algorithm in [1] to solve the transformation T ; 16: (d) Applying the transformation to the limb and its descendants; 17: end if 18: end for 19: end for
5
Experimental Results
We setup 8 calibrated cameras on the ceiling of the scene in 3ds Max and render 8 video streams as the input of our human body shape and motion tracking. As we intend to make the motion capture technology evaluation platform flexible for human motion in a large range of space, we choose the optics of all the cameras to be 3.5mm lens that the average height of the human seen by each camera is about the 1/3 of the entire image height. Based on background subtraction method, we obtain 8 silhouettes of each frame that we intend to use in this research and use the method detailed in Section 2 for visual hull mesh reconstruction from the eight view silhouettes. Fig. 2 shows the effectively improved quality of visual hull using our method. The prior skinned mesh model consists of surface triangular mesh having about 30000 vertices, and 14 bones having six DOF each. Several types of human motion are tested, and hierarchical articulated ICP where motion parameters are estimated hierarchically is better than the basic ICP in which motion parameters are estimated simultaneously. However, the hierarchical ICP method tends to fail when tracking the adjacent body segments if they are near ‘cylindrical-shaped’ as shown in Fig. 6. The first row represents the frames 16, 30, 40, 50 and 63 of the original video recorded by one of the
Human Body Shape and Motion Tracking by Hierarchical Weighted ICP
415
(a) Five captured frames from one of the eight video cameras
(b) 3D body shape tracking using hierarchical ICP
(c) 3D body shape tracking using hierarchical weighted ICP
(d) Estimated motion when using HWICP algorithm (front view)
(e) Estimated motion when using HWICP algorithm (side view) Fig. 6. Tracking results. Note that: although the 3D shape tracking in (c) suffers the local distortion from the inherent limitations of LBS, the final estimated human motion shown in (d) and (e) still remains good enough.
416
J. Chen et al.
cameras of the motion capture system. The second row shows the 3D body shape tracking results using the hierarchical ICP where the tracking of right arm failed when the lower arm and upper arm are near ‘cylindrical-shaped’ during human motion. On one hand, although our new visual hull calculation method results of a quality greatly improved surface mesh, it is still a rather rough shape; meanwhile, the 3D character shape model used by us has some defects that have been shown in Fig. 5; on the other hand, the surface shape difference of lower arm and upper arm is not so significant like thigh and calves. So originating from the local optimization nature of ICP algorithm, the tracking failed. The third row shows the obvious improved tracking results using our hierarchical weighted ICP algorithm. The forth row shows the corresponding bone poses when using our HWICP algorithm with front view and the fifth row shows the bone poses with side view. We can see that although the 3D shape tracking in (c) suffers the local distortion from the inherent limitations of LBS [9] (known as ‘collapsing elbow problem’ which results directly from the fact that the deformation is restricted to the indicated skeleton-subspace), the final estimated human motion shown in (d) and (e) still remains good enough.
6
Conclusion and Future Work
In this paper, a markerless motion capture algorithm for tracking both the human body shape and the whole body motion with complete six DOF movement of each body limb is presented, without imposing rotation or translation constraints like the work of predecessors. A new approach is proposed, which features the new visual hull mesh reconstruction, the skinned model with an easy-to-adjust strategy for selecting the set of ICP registration points, the ANN method for faster nearest neighbors searching and the hierarchical weighted ICP algorithm which is shown to benefit from the weighting strategy. For future research, it is planned to incorporate 2D features such as optical flow to refine the tracking result and to use GPU to increase tracking speed. Then, we will build the hardware system for real human motion capture and use the real data to test our approach. Acknowledgments. This project is partially supported by Natural Science Foundation of China (NSFC No.50805031 and No.61063019) and Science & Technology Basic Research Projects of Shenzhen (No. JC200903120184A, JC201005260161A), Foundation of the State Key Lab of Digital Manufacturing Equipment & Technology (No. DMETKF2009013). We thank Stefano Corazza from Stanford University for providing the 3D model of subject S4.
References 1. Akca, D.: Generalized procrustes analysis and its applications in photogrammetry. Tech. rep., ETHz (2004) 2. Ballan, L., Cortelazzo, G.: Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes. In: Proceedings of 3D Data Processing, Visualization and Transmission (3DPVT 2008), pp. 36–43 (2008)
Human Body Shape and Motion Tracking by Hierarchical Weighted ICP
417
3. Besl, P., McKay, H.: A method for registration of 3-d shapes. IEEE Trans. on PAMI 14(2), 239–256 (1992) 4. Corazza, S., Gambaretto, E., Andriacchi, T.: Automatic generation of a subjectspecific model for accurate markerless motion capture and biomechanical applications. IEEE Trans. on Biomedical Engineering 57(4), 806–812 (2009) 5. Corazza, S., M¨ undermann, L., Andriacchi, T.P.: Markerless motion capture through visual hull, articulated icp and subject specific model generation. International Journal of Computer Vision (IJCV) 87, 156–169 (2010) 6. Gall, J., Rosenhahn, B., Brox, T., Seidel, H.: Optimization and filtering for human motion capture. Int. Journal of Computer Vision (IJCV) 87, 75–92 (2010) 7. Kehl, R., Bray, M., Van Gool, L.: Full body tracking from multiple views using stochastic sampling. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 129–136 (2005) 8. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. on PAMI 16, 150–162 (1994) 9. Lewis, J.P., Cordner, M., Fong, N.: Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In: Proceedings of the 27th SIGGRAPH, pp. 165–172 (2000) 10. Magnenat-Thalmann, N., Laperrire, R., Thalmann, D.: Joint-dependent local deformations for hand animation and object grasping. In: Proceedings of Graphics Interface 1988, pp. 26–33 (1988) 11. M´enier, C., Boyer, E., Raffin, B.: 3d skeleton-based body pose recovery. In: Proceedings of Third International Symposium on 3DPVT, pp. 389–396 (2007) 12. Moeslund, T.B., Hilton, A., Kr¨ uger, V.: A survey of advances in vision-based human motion capture and analysis. CVIU 104, 90–126 (2006) 13. Moschini, D., Fusiello, A.: Tracking human motion with multiple cameras using an articulated model. In: Gagalowicz, A., Philips, W. (eds.) MIRAGE 2009. LNCS, vol. 5496, pp. 1–12. Springer, Heidelberg (2009) 14. Mount, D.M., Arya, S.: Ann programming manual, version 1.1 (2010), http://www.cs.umd.edu/~ mount/ANN/ 15. Mundermann, L., Corazza, S., Chaudhari, A.M., Andriacchi, T.P.: Most favorable camera configuration for a shape-from-silhouette markerless motion capture system for biomechanical analysis, vol. 5665, pp. 278–287. SPIE, San Jose (2005) 16. Ogawara, K., Li, X.L., Ikeuchi, K.: Marker-less human motion estimation using articulated deformable model. In: Proceedings of the IEEE ICRA, pp. 46–51 (2007) 17. Poppe, R.: Vision-based human motion analysis: An overview. Computer Vision and Image Understanding 108, 4–18 (2007) 18. Sigal, L., Black, M.J.: Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Tech. rep., Brown University (2006) 19. Song, P., Wu, X., Wang, M.Y.: A robust and accurate method for visual hull computation. In: Proc. of the IEEE ICIA, pp. 784–789 (2009) 20. Takahashi, K., Hashimoto, M.: Remarks on markerless human motion capture from voxel reconstruction with simple human model. In: Proc. of the IEEE/RSJ Int. Conference on IROS, pp. 755–760 (2008)
Multi-view Head Detection and Tracking with Long Range Capability for Social Navigation Planning Razali Tomari, Yoshinori Kobayashi, and Yoshinori Kuno Graduate School of Science & Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570, Japan {mdrazali,yosinori,kuno}@cv.ics.saitama-u.ac.jp
Abstract. Head pose is one of the important human cues in social navigation planning for robots to coexist with humans. Inferring such information from distant targets using a mobile platform is a challenging task. This paper tackles this issue to propose a method for detecting and tracking head pose with the mentioned constraints using RGBD camera (Kinect, Microsoft). Initially possible human regions are segmented out then validated by using depth and Hu moment features. Next, plausible head regions within the segmented areas are estimated by employing Haar-like features with the Adaboost classifier. Finally, the obtained head regions are post-validated by means of their dimension and their probability of containing skin before refining the pose estimation and tracking by a boosted-based particle filter. Experimental results demonstrate the feasibility of the proposed approach for detecting and tracking head pose from far range targets under spot-light and natural illumination conditions. Keywords: Depth segmentation, Head detection, Tracking, RGBD camera.
1 Introduction Human-robot interaction during navigation has received much attention in recent years since robots will coexist with humans in near future. With this capability, robots will consider social aspect of interaction with people for planning action. For instance, when a robot runs into a human from human’s left space; conventional planning methods based on free space availability may suggest both direction movements. Considering social acceptable rule of encounter, however, the direction to the right will be more appropriate since humans feel more comfortable when robots pass through their back side rather than directly crossing their frontal space. To realize such interaction, robots must be capable of sensing humans in their proximity and subsequently predicting their position, orientation and intention. Some of basic human cues legible for such purposes are heads, legs and whole bodies. Among all available features, we opt for head sign on account of the fact that it can provide instantaneous information about human’s intention and is extremely useful to analyze how aware the person is of the robot’s existence in the scene. Research trend in head detection can be divided into the frontal view case [1-5] and the multi-view case [6-9]. Information about the latest survey in this field can be found in [10]. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 418–427, 2011. © Springer-Verlag Berlin Heidelberg 2011
Multi-view Head Detection and Tracking with Long Range Capability
419
For our purpose, we focus on the multi-view case, since during maneuver the robot may encounter humans from any direction and hence human heads may be observed in arbitrary pose. Multi-view head detection needs to solve two sequential tasks: 1) to discriminate between face and non-face regions, and 2) to identify the face poses. Despite many important research efforts devoted to this problem, the algorithm development is still an open research issue in the human-robot interaction field. Conventional algorithms can be divided into two major classes namely the feature based approach and the image based approach. The feature based approach generally is low in computational demanding. However the bottleneck is that contour information is hard to manipulate on highly clutter background and when the head’s shape is non-uniform such as in human’s wearing a hat or in occlusion cases. Therefore we have adopted the other approach. The image based approach exhaustively searches an entire image by using the sliding window principle and validates each sub-window for faces by using a linear or nonlinear filter. Current state-of-the-art methods are based on Viola and Jones framework [5], which is proven to work in real time with high accuracy. To achieve multi-view requirements, each cascade classifier is trained with features extracted from single face poses or mixed poses arranged in parallel, pyramid or decision tree structures [10]. All of these methods benefit from Haar-like features extracted rapidly with the help of integral images. In [8], J. Meynet et al. proposed to fuse Gaussian features with Haar-like features to estimate the pose more accurately based on the fact that a simple linear filter is fast but unable to discriminate poses well unlike the nonlinear Gaussian filter. M. Chen et al. [9] attempted to perform profile head detection on the gradient space (edge and contour). They conclude that the gradient image works well to differentiate multi-view head images with similar performance when executed on grayscale images; but unfortunately false alarms increase. When depth information is available, it can be helpful for preliminary guessing areas that most likely contain faces. S.-H. Cho et al. [7] inferred close range human positions by using 2D spatial-depth histogram features obtained from a depth image. Then they estimated human poses by using four 2D elliptical filters with specific orientation. The detected areas were verified as humans by using head, shape, contour and color cues. M. Dixon et al. [3] used depth information to filter out spurious face detection. The method was later extended by W. Burgin et al. [4]. They applied the sliding window method on the entire image and rejected any sub-window that does not represent realistic face geometric constraints (size, range and texture). Only subwindow that survives this stage is evaluated by the face detector. We build our system motivated by [7] and tackle the issues from a different perspective to make it applicable for handling long range targets. In general, our work closely related to [3] and [4] in a sense that depth information is used for reducing false positive errors. However our method differs from these works in a way that we does not apply the multi-scale sliding window approach on the entire image, instead we predict early hypotheses of areas that most likely are humans by using certain feature constraints obtained from the depth map. Obviously this step will generate multiple regions of interest (ROI) that represent portions of the whole image. The sliding window principle is only then applied to the ROIs, which significantly reduces the computational cost of the head detector.
420
R. Tomari, Y. Kobayashi, and Y. Kuno
2 System Setup The proposed system was implemented on a robotic wheelchair (TT-Joy, Matsunaga Corporation) mounted with a forward-looking RGBD camera (Kinect, Microsoft) as shown in Fig. 1 (left). Kinect camera projects an infrared pattern to the surrounding and obtains distance data based on the triangulation for each projected feature. In our implementation, the camera is located 1.3 meter above the ground and can supply both RGB and depth images in real time by utilizing its onboard processor.
RGBD Camera
Calibration Parameters RGB Depth
Depth Assisted Object Segmentation
Region Validation
Depth to Grayscale Conversion Head’s Pose Tracking
Head’s Region Validation
Head Detection
Fig. 1. Outline of the system: (left) Hardware setup. (right) System block diagram
In our work we use standard Viola-Jones algorithm [5] for detecting multi-view faces. To satisfy the long range requirement we use images of resolution 640x480 since from our initial testing, on low resolution (320x240) only heads lying within 4 meters are detected, i.e., head dimension beyond that range is less than 20x20 pixels (which is the suggested size for training the cascade classifier). With this resolution, inevitably the processing time increases (average 3 fps) and so do false alarms. However by using the proposed model in Fig.1 (right) both computational cost and false alarms can be reduced. From the object segmentation module, hypotheses of probable human regions (more specifically heads) are generated from x-z plane (z: depth direction) using the connected component analysis by adopting the human size constraint. The acquired regions are then mapped onto x-y plane of the RGB and Depth image, respectively. Once the initial image regions of interest (ROI) are attained, we validated them as human/non-human by using upper human’s silhouette (torso and head) features. Only the validated areas are further evaluated by the head detector and subsequently examined for valid dimension and skin availability. Finally the head poses are estimated and tracked by the boosted-based particle filter. In the following section, details on computational methods for each module are described. 2.1 Camera Calibration and 8-Bit Depth Image Conversion To allow reasoning about RGB pixel placement in the 3D world coordinate, we make use of depth data supplied from the RGBD camera. Since the focal points of both
Multi-view Head Detection and Tracking with Long Range Capability
421
color and range cameras are located in different axes, we need calibration beforehand to rectify the cameras parallax. Details about the calibration process can be found in [11]. In short, the process estimates intrinsic parameters for both RGB and IR cameras, and determines the transformation matrix that maps pixels between them by equation (1) where K is the camera intrinsic matrix; H is the extrinsic matrix for projection from the depth image onto the RGB image. 1 X RGB = K −RGB HKDepthX Depth
(1)
Since Kinect provides depth images with 11 bits per pixel. We convert them into 8bit grayscale images for convenience in image processing. However, if we simply normalize each value, detailed depth information for distant objects will be lost. From our initial test, depth data, dx,y have high resolution if objects are located within 0.5 meters to 4 meters and low resolution beyond that. Based on this fact, we use adaptive normalization using equation (2) for obtaining a grayscale image G(x,y), which retains important depth information for far objects.
(
)
⎧ d x , y − 490 510 x 200 if 490 ≤ d x , y ≤ 1000 ⎪ G ( x , y ) = ⎨ 200 + d x , y − 1000 if 1000 < d x , y < 1050 ⎪ 255 elsewhere ⎩
(
)
(2)
2.2 Depth Assisted Object Segmentation For inferring object existence in the scene, we use a segmentation method based on [12]. Initially, a plan view map is constructed around the camera’s surrounding via the depth data on the x-z plane. For pruning out floor and ceiling information, we use information about object height. Objects are considered valid if their height lies within two predefined threshold values hmin and hmax. We carefully tune these values to ensure that the predicted map may not cut away too much objects on the ground and may simultaneously remove the ceiling plane. In our implementation, hmin and hmax are equal to 40cm and 180 cm, respectively. Once unnecessary pixels are removed, a binary map is generated by applying grey-level thresholding to the plan view map. Noises as a result of the elimination and threshold process are removed by exploiting combination of erosion and dilation operations. An example of obtained binary map is shown in Fig.2 (a). From the binary map, connected component analysis is performed and any object entities that are too small or larger than the normal human dimension are filter out, leaving possible regions that most likely are humans (indicated by red boxes in Fig.2 (a)). Next, the information about these regions Otv = [xtv, ytv, wtv, htv, dtv], denoted by x-position, y-position, width, height, and depth, respectively, are projected onto the distance image Og = [ xg, yg, wg, hg, dg], and RGB image Or = [ xr, yr, wr, hr, dr], by using the calibration parameters discussed in section 2.1. A sample output of this process is shown in Fig. 2
422
R. Tomari, Y. Kobayashi, and Y. Kuno
Y
Z X
Y X
(a)
X
(b)
(c)
Fig. 2. (a) Plan view map (x-z space) with possible location of human’s (red box). (b) The correspond location in depth image (x-y space). (c) In RGB image (x-y space).
2.3 Region Validation The segmented regions may contain human/non-human areas. Since running the head detector on all regions is costly, we propose a simple and fast filtering method for rejecting distinct non-human areas. To do so, we convert each candidate region to a binary image by using the region’s distance information. The value ‘1’ is assigned if the depth value is lower than the distance (dg) and ‘0’ otherwise. Region candidates are then confirmed as valid human areas via a linear filter with Hu moment [13]. Hu features have been successfully used in [14] for shape recognition. Its reputation of achievement has promoted it to be a popular technique for classification. One advantage suggested by the moment is that it can easily be invariant in 2D transformation such as translation and scaling, which is very convenient and suitable for our purpose. In this work, Hu values are standardized according to their power unit length due to the fact that the original values are too small. We collected samples of 350 human silhouettes from different poses as shown in Fig. 3(left), and computed seven Hu features of these data to examine their distribution. From this process, we have found that only the Hu_1 values produce consistent data during mapping, and therefore we construct our filter plane by performing the least square fitting method on this data.
Fig. 3. (left) Samples used for constructing the linear filter. (right) Hu moments filtering result, red rectangles denote possible human areas supplied to the next head detection process.
Segmented regions where Hu_1 values lie far from the constructed plane are considered unreliable and eliminated from the scene. A sample outcome of this process can be seen in Fig. 3 (right), which shows that numbers of erroneous
Multi-view Head Detection and Tracking with Long Range Capability
423
segmented areas are correctly removed (compare to Fig. 2 (c)) and that the remaining one still retains the human’s region. Since the output of this process still contains small amount of non-human regions, we cannot simply use the silhouette information for locating head positions. For this reason, we localize head positions on the validated RGB image regions by using Viola & Jones framework [5]. 2.4 Head Detection We extract a set of validated regions containing human heads by using the AdaBoostbased cascade classifier trained to recognize multi-view faces. This classifier works by constructing a strong classifier (positive images) as linear combination of a large pool of weak classifiers (negative images). In the detection process, a series of classifiers are applied to every image sub-window. Regions are considered valid if they pass through all the classifier stages while apparently most of the regions are normally rejected in early stages. However, relying on this detector alone is not enough; still numbers of non-face regions are often detected. To overcome this, each detected region is further refined by examining its dimensionality and probability of containing skin color. In this work, we combine skin color detection from [15] with the gray world assumption [16], which increases the capability to tolerate with some degree of low illumination. The region is considered valid if its dimension is within the range of normal human size and the skin probability rate is higher than a predefined threshold value. Fig. 4 (left) demonstrate an example of head detection result by the proposed method. As can be seen, our method can locate the head region accurately and at the same time effectively remove false alarms.
Fig. 4. Head detection result by the proposed method (left) and the result obtained by directly applying the head detector to the same image data (right)
To show the feasibility of our proposed method, we conducted another experiment by running the head detector directly on the whole image sequence. The result is shown in Fig. 4(right). It indicates that, even though the head region is correctly detected, there exist a numerous false alarm regions. On the contrary, our method does not exhibit such behavior; we can efficiently handle the false alarm problem with the help of the segmentation and validation procedures. In both examples, the target location is around 6 meters from the camera. It is worth to mention that we use the same detector for both examples.
424
R. Tomari, Y. Kobayashi, and Y. Kuno
2.5 Head Tracking The detection step gives possible regions of head location. To continuously predict the head poses, the regions must be tracked over time and hence we adopt the method based on the particle filter framework [16]. In our work, we define an independent particle filter for each head region. Regions are assigned as newly tracked objects if their Euclidean distances between all the currently tracked regions exceed the minimum requirement. On the other hand, tracks are deleted if their track’s stable counts become lower than the predefined threshold. State vector of each region is denoted by a bounding-box htm= [xt yt wt θt], where the parameters are center x, center y, width, and head angle. Particles at time t are projected by using its previous information and the average of eight point optical flows within the defined bounding-box. The random Gaussian vectors are added to each particle distribution in order to provide the system with a diversity of hypotheses. The measurement model is used to evaluate the particle confidence level by computing its weight. We use two evaluation methods based on the image contour obtained from Sobel edge detector and the pre-trained seven cascade classifiers for frontal, left 45 ̊, right 45 ̊, left 90 ̊, right 90 ̊, left back and right back faces. Overall particle weight is computed by combining likelihoods from the image contour and the classifiers. Afterwards, the current state of each target is estimated by using the average weight of total particles. Fig. 5 contains a sample image result at frame 215 produced by the tracker using the proposed method (left); and for comparison that by the detector alone explained in Section 2.5 (right). Comparing both figures shows that our method can track the head region precisely and eliminate most of the false alarms that greatly affect the tracking results when performing the head detection directly on the image sequences.
Fig. 5. Sample head tracking results at frame 215 in a sequence of 280 frames. By the proposed method (left) and by the method running the detector directly on each incoming image frame (right).
3 Experimental Results In this section, we present a number of experiments conducted for assessing the system performance. The system runs on 2.4 GHz i5-450M processor. We measured the performance on five different real scenarios labeled as lab 1, lab 2, hallway 1, hallway 2, and hallway 3 based on two criteria: 1) correctly locating head regions, 2)
Multi-view Head Detection and Tracking with Long Range Capability
425
accurately tracking head poses based on seven classes as illustrated in Fig. 6 (front (R1), left-front (R2), left (R3), left-back (R4), right-back (R5), right (R6), right-front (R7)). For simplifying the evaluation process, we grouped the poses into three main categories: Front (R1, R2, and R7), Left (R3 and R4) and Right (R6 and R5). In lab environments the video was captured from a static base with spot-light illumination atmosphere, while in the hallway the video was captured by the robotic wheelchair that navigates at a constant velocity under natural light exposure (hallway 1 and hallway 2) and spot-light exposure (hallway 3). Targets in the experiments moved randomly within the range of 1 meter to 8 meters from the camera.
R4
R5
R3
R6 R7
R1
R2
Fig. 6. Head pose classification for evaluation
Table 1 summarizes our results. It can be seen that, under good lighting condition our proposed system is able to gain high performance in term of head detection (average of 90%) and pose tracking (average of 82%). In the meantime, with natural light exposure we obtain fair performance of an average of 70% for head detection and an average of 60% for pose tracking. This is due to the fact that low lights exposure does not generate head texture well, and hence prevents the system from accomplishing tasks accurately. However, our proposed system exhibits low false positive (false alarm) error rates in all given situations with the help of the segmentation process and area validation procedure. Sample results of the proposed method for all the testing environments are shown in Fig.7. Table 1. Performance of the proposed system for handling five different situations Video Lab 1 Lab 2 Hallway 1 Hallway 2 Hallway 3
Total Frames 280 1170 1350 2130 1440
Head Detection True Positive False Positive 92% 92% 80% 60% 85%
0% 6% 1.2% 2.5% 1.8%
Correctly track poses 77% 85% 50% 70% 85%
To prove feasibility of our method, we compare the performance with the method running head detection directly on each incoming frame; and the result is given in Fig. 8. From this figure we conclude that our method can achieve high accuracy for locating head regions and low false detection rate in all given situations.
426
R. Tomari, Y. Kobayashi, and Y. Kuno
Fig. 7. Experimental results of the proposed system
Fig. 8. Performance comparison of the proposed method with the one that runs the head detector directly on each incoming frame
4 Conclusion and Future Work We have proposed a method of multi-view head detection and tracking with long distance capability from a mobile platform. It can reduce most of false alarm errors and at the same time gain high accuracy in tracking the pose information. In good illumination conditions, we obtained an average of 90% detection rate and 82% pose tracking rate with less than 3% of false alarm rate. Under natural light exposure, the average performance was around 70% for detection and 60% for pose tracking. For our purpose, this performance is acceptable since during navigation, we only make use of head pose information for planning more socially acceptable movement. Even though the system fails to supply accurate head pose, the chosen path still can be a safe route, but may bring an awkward situation to the human. Acknowledgments. This work was supported in part by JSPS KAKENHI (22243037).
Multi-view Head Detection and Tracking with Long Range Capability
427
References 1. Bohme, M., Haker, M., Riemer, K., Martinez, T., Barth, E.: Face Detection Using a Timeof-Flight Camera. In. Proc of the DAGM 2009, pp. 167-176 (2009) 2. Fisher, J., Seitz, D., Verl, A.: Face Detection using 3-D Time-of-Flight and Color cameras. In: 41st Intl. Symp. on Robotics and ROBOTIK, pp. 112–116 (2010) 3. Dixon, M., Heckel, F., Pless, R., Smart, W.D.: Faster and More Accurate Face Detection on Mobile Robots using Geometrical Constraints. In: Proc. IROS 2007, pp. 1041–1046 (2007) 4. Burgin, W., Pantofaru, C., Smart, W.D.: Using Depth Information to Improve Face Detection. In: Proc. HRI 2011 (2011) 5. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features In: Proc. of Int. Conf. on Comp. Vision and Pattern Recognition, pp. 511-518 (2001) 6. Kruppa, H., Santana, M.C., Schiele, B.: Fast and Robust Face Finding via Local Context. In. Proc. Joint IEEE Intl’ Workshop on VS-PETS 7. Cho, S.-H., Kim, T., Kim, D.: Pose Robust Human Detection in Depth Images Using Multiply-oriented 2D Elliptical Filters. Intl. Jnl. of Patt. Recog. 24(5), 691–717 (2010) 8. Meynet, J., Arsan, T., Mota, J.C., Thiran, J.-P.: Fast Multiview face Tracking with Pose Estimation. In: Proc. of the 16th European Signl. Processing Conf., pp. 1–12 (2008) 9. Chen, M., Ma, G., Kee, S.: Multi-view Human head Detection in Static Images. In: Proc. IAPR Conf. on Machine Vision Applications, pp. 100-103 (2005) 10. Zhang, C., Zhang, Z.: A Survey on recent Advances in Face Detection, Technical Report, Microsoft Research (2010) 11. http://www.ros.org/wiki/kinect_calibration/technical 12. Huang, Y., Fu, S., Thompson, C.: Stereovision-Based Object Segmentation for Automotive Applications. Proc. EURASIP Jnl. on App. Signl. 14, 2322–2329 (2005) 13. Hu, M.K.: Visual Pattern Recognition by Moment Invariants. IEEE Trans. On Information Theory 8, 179–187 (1962) 14. Lou, T., Kramer, K., Goldgof, D., Hall, L., Sampson, S., Remsen, A., Hopkins, T.: Learning to recognize plankton. In: IEEE Intl. Conf. on Systems, Man & Cybernetics, pp. 888–893 (2003) 15. Chai, D., Ngan, K.: Face Segmentation using Skin-Color Map in Video Phone Applications. IEEE Trans. Circt. and Syst. for Video Technology 9(4), 551–564 (1999) 16. Buchsbaum, G.: A Spatial Processor Model for Object Colour Perception. J. Franklin Institute 11(9), 1–26 (1980) 17. Kobayashi, Y., Sugimura, D., Sato, Y., Hisawa, H., Suzuki, N., Kage, H., Sugimoto, A.: 3D Head Tracking using The Particle Filter with Cascade Classifiers. In: Proc. BMVC, pp. 37–46 (2006)
A Fast Video Stabilization System Based on Speeded-up Robust Features Minqi Zhou1 and Vijayan K. Asari2 1
Old Dominion University, Norfolk, Virginia, USA 2 University of Dayton, Dayton, Ohio, USA [email protected], [email protected]
Abstract. A fast and efficient video stabilization method based on speeded-up robust features (SURF) is presented in this paper. The SURF features are extracted and tracked in each frame and then refined through Random Sample Consensus (RANSAC) to estimate the affine motion parameters. The intentional camera motions are filtered out through Adaptive Motion Vector Integration (AMVI). Experiments performed on several video streams illustrate superior performance of the SURF based video stabilization in terms of accuracy and speed when compared with the Scale Invariant Feature Transform (SIFT) based stabilization method. Keywords: Video Stabilization, Feature Extraction, Motion Vector Integration, SURF.
1
Introduction
Generally, video streams recorded by portable video cameras always suffer from unexpected shaky motion in various degrees. From the viewer’s perspective, it is hard to focus on the region of interest, since the video contains undesired shaky vibrations. Video stabilization is a process to rearrange the video sequence and remove the undesired motion. Since we define the shaky motion as high frequency components, a stabilized video is actually a video without these undesired high frequency motion components. Numerous video stabilization methods have been presented in the literature such as block matching [1-3], FFT based method [4], optical flow [5-6], phase correlation [13] and feature matching [7-8]. But, these methods would lose their validity in some cases. For example, block matching method is sensitive to illumination; noise and motion blur [3]. Phase correlation method is immune to white noise, however, it could only estimate the translation parameters, and the estimation result would be poor, if rotation and scale change occur. FFT based method can determine the translation, rotation and scaling differences between two images, but the numerical conversion from Cartesian to logpolar coordinates brings significant re-sampling error [4], which severely interfere the resulting transformation parameters. Optical flow methods cannot process large displacements well without multi-scale approaches. In addition, the performance also suffers if the image has little or no texture [6]. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 428–435, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Fast Video Stabilization System Based on Speeded-up Robust Features
429
Feature based methods extract stable features from each frame and estimates the inter-frame motion with these features. SIFT based video stabilization was introduced in 2007 [8]. This method is invariant to translation, rotation and partial invariance to illumination changes and 3D viewpoint [7]. SIFT feature descriptor performs better than other feature descriptors in most cases. However, due to its high dimensionality and computational complexity, SIFT feature descriptor is not suitable for real-time applications. Speeded-Up Robust Feature (SURF) was introduced in 2006. SURF is a fast and robust feature detector which is widely used in different computer vision applications such as object recognition and 3D reconstruction [9]. This method has been proved to be more efficient than other feature descriptors in terms of repeatability, distinctiveness and robustness [10].
Fig. 1. Diagram of SURF based video stabilization
Figure 1 shows the frame work of the SURF based video stabilization method presented in this paper. First, we extract Speeded-Up Robust Features from each frame. These features are matched, refined and then be used in motion estimation to approximate the vibration parameters. We adopt Motion Vector Integration (MVI) to separate the intentional motion from undesired vibrations, and finally we compensate the undesired vibrations to stabilize the video. The paper is organized as follows: Section 2 introduces the details of SURF implementation including feature extraction and matching. Section 3 presents motion estimation and filtering processes. Experimental results and analysis are presented in section 4 and conclusions are summarized in section 5.
2
SURF Implementation Details
SURF is a translation, rotation and scale invariant feature detector, which is based on Hessian matrix for its good performance in accuracy. Compared with SIFT which uses Difference of Gaussian (DoG) to approximate Laplacian of Gaussian (LoG), SURF pushes the approximation even further. It approximates Laplacian of Gaussian
430
M. Zhou and V.K. Asari
by using a box filter to represent the corresponding kernel. The kernel approximation is efficient in computation by using the integral images and hence the time consumption is independent of the filter size. Unlike SIFT which repeatedly smooth the image with a Gaussian filter and halve it to build an image pyramid, in SURF we can directly apply the box filter of any size on the original image to improve its computational efficiency. After building the image pyramid, the process continues by traversing the pyramid to remove those points with low contrast and then searching extrema on neighboring scale images. Finally, the points are localized to sub-pixel accuracy through scale space interpolation. SURF descriptor uses Haar wavelet in x and y directions to compute a reproducible orientation. To achieve rotation invariance, a square descriptor region is constructed along the dominant orientation and is divided into 4×4 sub-regions and the descriptor is extracted from it. In addition, SURF computes Haar wavelet through integral images, which decreases the computational complexity. Each wavelet requires only six operations to perform the computation. Since SURF divides all the feature points into two types by the sign of Laplacian, we can boost the matching speed by comparing the sign of Laplacian. In addition, we drop the unreliable matching by comparing the ratio of distances from the closest neighbor to the distance of the next closest neighbor with a predetermined threshold.
3
Motion Estimation
The previously extracted and matched features are used to approximate the global motion vector. First, we introduce the motion model we adopted in the following section. 3.1
Motion Model
The real camera motion between frames is a 3D motion. As a trade off between the complexity and efficiency, we adopt a 2D affine model to describe the motion between frames:
⎛x⎞ ⎛ cos θ ⎜ ⎟ = λ⎜ ⎝ y⎠ ⎝ sin θ
− sin θ ⎞⎛ x ' ⎞ ⎛ Tx ⎞ ⎟⎜ ⎟ + ⎜ ⎟ cos θ ⎠⎝ y ' ⎠ ⎝ Ty ⎠
(1)
This model describes the pixel displacement between two frames, where x and y represent the pixel position in the current frame, and x’ and y’ represent the pixel position in the next frame. It includes 4 parameters: θ is the rotation angle, λ is the zoom factor, Tx and Ty are the shift in x and y directions. To estimate these parameters, we need at least 2 pairs of matching features. After we extract the SURF features from the two consecutive frames, we can put these pairs of features into the affine model. We can solve this equation through least squares estimation method. Though we have roughly eliminated those unreliable matched features through comparing the ratio of distances with preset threshold, the local motion vectors still contain some mismatched features. The local motion vectors may also contain the matched features belonging to the ego-moving objects which cannot reflect to the camera motion. Since
A Fast Video Stabilization System Based on Speeded-up Robust Features
431
the least squares method is sensitive to outliers, it would introduce estimation error if we estimate the motion parameters directly. To solve this problem and get the exact motion parameters, we adopt the Random Sample Consensus (RANSAC) [11] to refine the matched features. This idea is to iteratively guess the model parameters using minimal subsets of points randomly drawn from the input features. Figure 2 illustrates the comparison between original feature set matching and the refined feature set matching. It has totally 106 pairs of matched features in the left image. The mismatched features are also included in the illustration. In the right image, all the mismatched features and some of the matched features are removed and the number of matched features is reduced to 85. 3.2
Motion Filter
The motion vectors between frames can be divided into two parts: undesired jitter and intentional camera motion. Directly approximating the motion parameters with the original motion vectors would cause errors since only the undesired jitter need to be compensated. In addition, in the real-time application, we need high speed performance. We cannot just store the current frame and wait until the number of frames reaches certain amount and then process them together. We need a real-time motion separation method to fix this problem. Motion Vector Integration (MVI) with adaptive damping coefficient [2] is a simple and quick method which can not only filter the cumulative motion curve but also change the damping extend according to the recent two global motion vectors. Actually, the cumulative motion vector at frame n is the summation of previous n-1 global motion vectors plus the global motion vector at frame n. In MVI, the cumulative motion vector at frame n-1 is multiplied by a damping coefficient δ which depends on the value of the latest 2 global motion vectors. The motion vector at frame n can be represented as: IMV (n) = δ × IMV (n − 1) + GMV (n)
(2)
Where GMV(n) is the global motion vector between frame n-1 and frame n. If the last two global motion vectors are small, δ is set to a high value which is close to 1. In this case, the integrated motion vector at frame n could strongly stabilize the video. Correspondingly, if the last two global motion vectors are big, δ is set to a relatively low value to compensate the undesired small jitter and preserve the major camera trajectory.
4
Experimental Results
We evaluated the performance of the proposed method with several video sequences covering different types of scenes to observe the processing speed and ensure the number of features per frame. The frame size of all the input sequences was fixed as 240×320. The experiment was carried out with Visual Studio 2008 in Windows Vista Operating System on an Intel Core 2 Duo 2.4GHz CPU system. We adopted Inter-frame Transformation Fidelity (ITF) [2] to evaluate our video stabilization performance. ITF is computed as:
432
M. Zhou and V.K. Asari
ITF =
1 N frame − 1
N frame−1
∑
PSNR( n)
(3)
n =1
Where Nframe is the number of frames in the video and PSNR(n) is the corresponding Peak Signal-to-Noise Ratio (PSNR) between frame n-1 and frame n, which is defines as: PSNR( n) = 10 log10
2 I MAX MSE ( n)
(4)
ITF is, in fact, the average Peak Signal-to-Noise Ratio (PSNR) of the entire video stream. If motion occurs between frames, ITF would be low. So the video processed by video stabilization system would have a relatively high ITF and a higher value is desired. Figure 2 (i) shows a set of video frames illustrating the effectiveness of our stabilization method. The 10th, 20th and 35th frames of the video are picked up and shown in Figure 2 (i). The top row shows the original video frames and the bottom row shows the stabilized video frames. Additional coordinate (marked red) is added to locate the relative position of objects in the video sequence. In the stabilized video sequence, the scene remains static and the undesired motion is completely removed. Figure 2 (ii) shows the PSNR comparison between the original sequence and the stabilized sequence. As we can see, the PSNR value for each couple of frames in the stabilized sequence has a relatively high value, which proves that the SURF based video stabilization system has better performance.
(a)
(d)
(b)
(e) (i)
(c)
(f) (ii)
Fig. 2. Result of static scene video stabilization, a, b, c: Original input video sequence, c, d, e: stabilized video sequence Video. Right: PSNR comparison between original video and stabilized video.
The next test is to evaluate the system capacity to process the video captured by static camera with moving objects. Figure 3 (i) shows the result. Though the moving objects are included in the video, we used RANSAC to eliminate those feature points extracted from them. That is why a moving object doesn’t influence the video stabilization performance. Figure 3 (ii) shows its corresponding performance curvature. As anticipated, the stabilized video has a higher average PSNR value than
A Fast Video Stabilization System Based on Speeded-up Robust Features
(a)
(b)
(c)
(d)
(e)
(f) (i)
433
(ii)
Fig. 3. Result of static scene with moving object video stabilization, a, b, c: Original input video sequence, c, d, e: stabilized video sequence Video. Right: PSNR comparison between original video and stabilized video.
(a)
(b)
(c)
(d)
(e)
(f) (i)
(ii)
Fig. 4. Result of moving scene with moving object video stabilization, a, b, c: Original input video sequence, c, d, e: stabilized video sequence Video. Right: PSNR comparison between original video and stabilized video.
the original video. However, unlike figure 2, the PSNR difference between original video and the stabilized video decreases after frame 140. This effect results from the moving object in the video. Since PSNR is used to measure the similarity between two frames, though the full-frame motion is completely compensated, the baby’s movement is still the same, which reduced the PSNR value. Since the baby still didn’t appear before frame 140, the PSNR is higher. The last testing is carried out with the video including moving objects captured by ego moving camera, and figure 4 shows the result. Since the camera moves intentionally, we can’t observe the improvement directly through the images, and the improvement is reflected on figure 4 (ii). The stabilized video (blue curve) has a higher PSNR than the original video (red curve), and the average PSNR of stabilized video is much higher than that in the original one. In addition, we can get the intentional motion of the camera through the graph. The PSNR value in between 250 and 550 is lower, which means the camera has intentional motion in this interval. Though the undesired vibration has been completely removed, MVI preserved the
434
M. Zhou and V.K. Asari
intentional camera motion causing the scene changes, which greatly degrade PSNR, and it created a trough in the curve as the result. Table 1 gives the ITF values for the original and SIFT, SURF stabilized sequences. Both SIFT and SURF has the same ITF in the stabilized video. However, SURF took only about 150ms on average to extract features from each frame when compared with SIFT based technique which took 2 seconds on average to process each frame. The static scene video stabilization has the best performance with 6.7 dB improvement. The other 2 cases have 5.7 dB ITF improvements. The moving objects introduced greater Mean Square Error between consecutive frames in the last two cases. Table 1. ITF on the original and stabilized sequences
5
Sequence
Original ITF (dB)
SIFT Stabilized ITF (dB) 27.8
Static Scene
21.1
Static Scene + Moving Object Moving Object + Egomoving camera
31.0
36.7
36.7
22.0
27.7
27.7
SURF Stabilized ITF (dB) 27.8
Conclusions
In this paper, we proposed an efficient approach for video stabilization. We adopted speeded-up robust features as feature descriptor. The features are extracted and tracked in each frame. Those features are matched through comparing the ratio of distance from the closest neighbor to the distance of the next closest neighbor. After that, we further refined the matching features through RANSAC, estimating the motion parameters through least squares method and computed the integrated motion vector through MVI. Finally, we compensated the undesired jitter with the precomputed motion parameters. Since the convolution process is consistent in SURF, the speed performance of SURF based video stabilization has been proved significantly faster than SIFT based method. If we apply box filters of any size directly on the original image in parallel with the help of additional hardware, the speed of this video stabilization method can be further improved, which would make real-time video stabilization system possible in larger video frames too.
References 1. Battiato, S., Puglisi, G., Bruna, A.R.: A Robust Video Stabilization System By Adaptive Motion Vectors Filtering. In: IEEE International Conference, pp. 373–376 (2008) 2. Auberger, S., Miro, C.: Digital Video Stabilization Architecture for Low Cost Devices. In: 4th International Symposium on Image and Signal Processing and Analysis, pp. 474–479 (2005) 3. Chen, T.: Video Stabilization Algorithm Using a Block-Based Parametric Motion Model. Stanford University, EE392J Project Report winter (2000)
A Fast Video Stabilization System Based on Speeded-up Robust Features
435
4. Srinivasa Reddy, B., Chatterji, B.N.: An FFT-Based Technique for Translation, Rotation, and Scale-Invariant Image Registration. IEEE Transaction on Image Processing 5(8), 1266–1271 (1996) 5. Chang, J.-Y., Hu, W.-F., Cheng, M.-H., Chang, B.-S.: Digital Image Translational And Rotational Motion Stabilization Using Optical Flow Technique. IEEE Transactions on Consumer Electronics 48(1), 108–115 (2002) 6. Denman, S., Fookes, C., Sridharan, S.: Improved Simultaneous Computation of Motion Detection and Optical Flow for Object Tracking. Digital Image Computing: Techniques and Applications, 175–182 (2009) 7. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision (2004) 8. Battiato, S., Gallo, G., Puglisi, G., Scellato, S.: SIFT Features Tracking for Video Stabilization. In: 14th International Conference on Image Analysis and Processing, ICIAP 2007, pp. 825–830 (2007) 9. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: SURF: Speeded Up Robust Features. Computer Vision and Image Understanding (CVIU) 110(3), 346–359 (2008) 10. Ramisa, A., Vasudevan, S., Aldavert, D.: Evaluation of the SIFT Object Recognition Method in Mobile Robots. In: Proceedings of the 12th International Conference of the Catalan, pp. 9–18 (2009) 11. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communication of ACM 4(6), 381–395 (1981) 12. Juan, L., Gwun, O.: A Comparison of SIFT, PCA-SIFT and SURF. International Journal of Image Processing (IJIP) 3(4), 143–152 13. Kwon, O., Shin, J., Paik, J.: Video Stabilization Using Kalman Filter and Phase Correlation Matching. LNCS, pp. 141–148 (2005)
Detection of Defect in Textile Fabrics Using Optimal Gabor Wavelet Network and Two-Dimensional PCA A. Srikaew1 , K. Attakitmongcol1 , P. Kumsawat2 , and W. Kidsang1 1 School of Electrical Engineering School of Telecommunication Engineering Institute of Engineering, Suranaree University of Technology 111 University Avenue, Muang District, Nakhon Ratchasima, Thailand {ra,kitti,prayoth}@sut.ac.th, [email protected] 2
Abstract. The aim of production line enhancement in any industry is to improve quality and reduce operating costs by applying various kinds of advanced technology. In order to become more competitive, many sensing, monitoring, and control approaches have been investigated in the textile industry. Automated visual inspection is one area of improvement where real cost savings can be realized over traditional inspection techniques. Manual visual inspection of textile products is expensive and error-prone because of the difficult working environment near the weaving machine. Automated visual detection of fabric defects is particularly challenging due to the large variety of fabric defects and their various degrees of vagueness and ambiguity. This work presents a hybrid application of Gabor filter and two-dimensional principal component analysis (2DPCA) for automatic defect detection of texture fabric images. An optimal filter design method for Gabor Wavelet Network (GWN) is applied to extract texture features from textile fabric images. The optimal network parameters are achieved by using Genetic Algorithm (GA) based on the non-defect fabric images. The resulting GWN can be deployed to segment and identify defect within the fabric image. By using 2DPCA, improvement of defect detection can significantly be obtained. Experimental results indicate that the applied Gabor filters efficiently provide a straight-forward and effective method for defect detection by using a small number of training images but still can generally handle fabric images with complex textile pattern background. By integrating with 2DPCA, desirable results have been simply and competently achieved with 98% of accuracy.
1
Introduction
Nowadays, textile and garment industry are one of the most competitive industry in both marketing and production technology. Higher wage has become the main reason for all manufactures to pay attention to develop any technology for reducing operation cost and increasing product quality. Especially for textile G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 436–445, 2011. c Springer-Verlag Berlin Heidelberg 2011
Detection of Defect in Textile Fabrics Using Optimal GWN and 2DPCA
437
material, any defective appearance can reduce its price significantly. For product quality control, the process is manually done by human operator which can be exhausted, neglectful and error-prone. Automatic defect detection system can then deliver faster production line and better quality product for textile industry. Over a decade, many automatic defect detection systems for fabric images have been developed. Initially, such system has been deployed for defect detection of solid color fabrics with threshold technique [1]. Later on, systems to detect more complicated pattern fabrics have been introduced for both types of slanted pattern fabric and jean. Many methods have been proposed for fabric defect detection including statistical [2], spectral [3], and structural [4] methods. One of the most popular spectralbased tools is Gabor filter which has been widely used for defect detection of fabric images [2,5,6]. Gabor filters with various scales and orientations can be used for desired pattern. These filter banks have direct influence on classification and recognition characteristics [7]. On the other hand, the optimal Gabor filter can be used without limitation of scale and orientation information. The remaining issue is then to determine the optimal parameters of Gabor filters. This work presents using genetic algorithm (GA) to optimize parameters of Gabor wavelet network and applying 2-D principals component analysis (2DPCA) for improving defect detection of fabric images. Overall system diagram is shown in Fig. 1.
Fig. 1. Overall fabric defect detection system
2
Gabor Filters
Gabor filter is the bandpass filter composed of impulse response functions which can be obtained by modulation of Gaussian functions and sinusoidal functions.
438
A. Srikaew et al.
Equation (1) and (2) shows 2D Gabor function consisted of both odd and even functions where (x , y ) are coordinate (x, y) with rotation of θ, ω is the central frequency of the Gabor function modulation, σx and σy are Gaussian standard deviation along x and y axis, respectively. Fig. 2 demonstrates odd and even function of Gabor filter in time domain. This odd function can be used for edge detection while even function is widely used for stain and defect detection [8,9].
2
2
x − 12 + σyy 1 σx e e2πjωx g(x , y ) = 2πσx σy cos θ − sin θ x x = y sin θ cos θ y
(a)
(1)
(2)
(b)
Fig. 2. Gabor filter in time domain (a) Even function (real part) (b) Odd function (imaginary part)
3
Gabor Wavelet Network
Gabor Wavelet Network (GWN) has been proposed to solve two-dimensional pattern recognition problem [10]. The transfer function of hidden layer represented by an imaginary part of Gabor function is displayed in Equation (3) where wi is synaptic weights of the network. Vector input [ x y ]T of the network is a spatial position of each pixel from the input image and output of the network is a gray-level value of that pixel. f (x, y) =
n
wi goi (x, y)
(3)
i=1
Fig. 3 shows GWN architecture which is comprised of feed-forward networks. Each Gabor wavelet goi , which is represented by Equation (4), consists of 5 parameters: scales (tx and ty ), orientation (θ), Gaussian standard deviations
Detection of Defect in Textile Fabrics Using Optimal GWN and 2DPCA
439
Fig. 3. Gabor wavelet network architecture
(σx and σy ), modulation central frequency (ω) and network weights (wi ). Equation (5) is the objective function for training the network to reconstruct the input image IM (non-defect fabric image). 2 2 [(x−tix ) cos θi −(y−tiy ) sin θi ] [(x−tix ) sin θi −(y−tiy ) cos θi ] i go = exp − − i )2 2(σx 2(σi )2 (4)
y i i i i i × sin 2πω (x − tx ) cos θ − (y − ty ) sin θ Err = minIM −
wi g i 22
(5)
i
4
Optimal Gabor Wavelet Network
Parameters of Gabor odd function (tx , ty , θ, σx , σy , ω, and, wi ) are optimized by using of genetic algorithm (GA) with the objective function from Equation (5) for training the network. In this work, there are 70 parameters from 10 Gabor wavelets to be searched and used for image reconstruction with minimal error from prototype image. Fig. 4 shows examples of prototype image and reconstructed image using the optimal GWN. The size of the filter mask is 5 × 5 pixels. The optimal Gabor filter is chosen from these 10 Gabor wavelets with the maximum value of cost function [11] as shown in Table 1. Fig. 5-(a) demonstrates the test fabric image convoluted with 5 × 5 Gabor filter mask. The resulting convoluted images with even function and both even and odd function are displayed in Fig. 5-(b) and 5-(c). The convolution with both even and odd function provides more prominent area of defect within the image. The more suitable Gabor parameters can result in background suppression and defect saliency. This is generally the main goal of textile defect detection system.
440
A. Srikaew et al. Table 1. Optimal parameters from GA Searching tx 0
ty 63
θ -3.0309
σx 8.9421
σy 39.2850
(a)
ω 18.0926
wi 0.7856
(b)
Fig. 4. (a) Prototype image (b) Reconstructed image from the optimal GWN
(a)
(b)
(c)
Fig. 5. (a) Test fabric image convoluted with 5 × 5-pixel Gabor filter mask (b) Convoluted test image with even function (c) Convoluted test image with both even and odd function
5
2-D Principal Component Analysis
Two-dimensional principal component analysis (2DPCA) is applied in this work to achieve a two-dimensional representation of fabric image samples. This reduced dimension version of fabric image can help improving the efficiency of defect detection [12]. The best 2DPCAs (Yij , where i = 1, . . . , d and j = 1, . . . , M ) created from M samples of non-defect fabric images (IjP , where j = 1, . . . , M ) are used to compare with 2DPCA (YiB , where i = 1, . . . , d) of input image (I B ) for detecting defect in the input image. In order to compare between non-defect image prototype and input image, Euclidean distance between 2DPCAs of both images (I P ) and (I B ) is determined using Equation (6). Examples of both defect and non-defect fabric images which have been convoluted with optimal Gabor filter are displayed in Table 2.
Detection of Defect in Textile Fabrics Using Optimal GWN and 2DPCA
441
Table 2. Example of defect detection results using 2DPCA (from top to bottom: nondefect fabric, dirty yarn, slack end, thick bar, mispick, wrong draw, tear defect, thin bar and netting multiple) Gabor Image
Segmented Image
Result
Euclidean Distance
0.0987
0.6811
0.3573
0.5414
0.6810
0.3517
0.4063
0.3700
0.4016
442
A. Srikaew et al.
Fig. 6. Defect and non-defect fabric image classification results
The first row is an example of non-defect fabric sample. The rest of the table demonstrates 8 types of defect fabric samples which are dirty yarn, slack end, thick bar, mispick, wrong draw, tear defect, thin bar, and netting multiple. From the data analysis, an Euclidean distance of 0.3 was empirically derived to separate defect from defect-free fabric images. The details and discussion of the results are presented in the next section. dist(I B , IjP ) =
d i=1
6
YiB − Yij 2
(6)
Results and Discussion
The proposed defect detection system for fabric images has been tested with 256 × 256-pixel images from database of Central Textiles Limited, Hong Kong [13,14]. There are 18 non-defect fabric images and various types of 32 defect images. Fig. 6 displays the classification results of both defect and non-defect fabric images, including the pass/fail threshold value of 0.3. The arrow in Fig. 6 identifies the single misclassification of the system in which a non-defect image is incorrectly classified as a defective sample (see Fig. 7). This misclassified image is likely considered to be ambiguous by the judge of human eyes. The results from this system has also been evaluated using the popular receiver operating characteristic (ROC) graph [15] for both with and without 2DPCA. This is to affirm the efficiency of using 2DPCA for improving the classification accuracy.
Detection of Defect in Textile Fabrics Using Optimal GWN and 2DPCA
443
Euclidean Dist = 0.1710 Euclidean Dist = 0.8306 (a) (b) Fig. 7. (a) Correctly detect of non-defect image (b) Incorrectly detect of non-defect image Table 3. Defect detection results from ROC graph Method
Input Type
Defect Non-defect Defect Gabor Filter + 2DPCA Non-defect Total Number of Test Images Gabor Filter
Found 26 6 32 0 32 (defect)
Results Not Found 4 14 1 17 18 (non-defect)
Table 4. Percent efficiency of the proposed system where TPR (True Positive Rate) refers to detection of defect images correctly and FPR (False Positive Rate) refers to detection of non-defect image as defect image Detection Details Accuracy TPR FPR
% Efficiency w/o 2DPCA 80 81.3 22.22
w/ 2DPCA 98 100 5.56
The ROC test results are shown in Table 3 and Fig. 8. The results clearly show that using 2DPCA with Gabor filter provides significant improvement for defect classification of fabric images. The percent efficiency of the system with 98% of accuracy is displayed in Table 4. From results of defect detection, the system shows the ability to detect various kinds of fabric defect at any position within the image and with complex fabric pattern background. Applying 2DPCA also provides meaningful improvement of detecting fabric defect especially thin bar, thick bar and wrong draw defect type. The supervised training of GWN with only small number of sample is also very attractive. The generalization of the trained GWN allows the system to handle new input images flawlessly.
444
A. Srikaew et al.
Fig. 8. ROC Efficiency of the proposed system
7
Conclusions and Future Work
This work presents the application of Gabor filter for automatic defect detection of texture fabrics. An optimal filter design method for Gabor Wavelet Network (GWN) is proposed to extract texture features from textile fabric images. The optimal Gabor filter is achieved by using Genetic Algorithm (GA) based on the extracted features. The resulting filtered images are then segmented and labeled to identify the defect fabric image by using 2DPCA. Experimental results indicate that the applied Gabor filters provide a straight-forward and effective method for texture feature extraction. 2DPCA gives a significant improvement for detecting defects while providing detection accuracy of 98%. Result of misclassification, however, needs to be taken into account. Extension of test image data could be explored in order to improve the system performance. Furthermore, the system can efficiently detect various types of fabric defects at any position within the image having typical complex fabric pattern background using only a small number of training samples. While the system is capable of effectively identifying fabric defects, it lacks the capability of classifying type of defect. Future work will investigate improvements in detection accuracy and reduction of false positives. Acknowledgement. The financial support from Suranaree University of Technology is greatly acknowledged.
Detection of Defect in Textile Fabrics Using Optimal GWN and 2DPCA
445
References 1. Wang, J., Campbell, R., Harwood, R.: Automated inspection of carpets. In: Proceedings of SPIE, vol. 2345, pp. 180–191 (1995) 2. Kumar, A., Pang, G.: Defect detection in textured materials using optimized filters. IEEE Transaction on Systems, Man, and Cybernetics: Part B 32, 553–570 (2002) 3. Gonzalez, R., Woods, R.: Digital Image Processing, 2nd edn. Addison-Wesley Publishing Company, Reading (2002) 4. Allen, R., Mills, D.: Signal Analysis: Time, Frequency, Scale, and Structure. Wiley Interscience, Hoboken (2004) 5. Escofet, J., Navarro, R., Millan, M., Pladelloreans, J.: Detection of local defects in textiles webs using gabor filters. Optical Engineering 37, 2297–2300 (1998) 6. Bodnarova, A., Bennamoun, M., Latham, S.: Optimal gabor filters for textile flaw detection. Pattern Recognition 35, 2973–2991 (2002) 7. Mak, K., Peng, P.: Detecting defects in textile fabrics with optimal gabor filters. Transactions on Engineering, Computer and Technology 13, 75–80 (2006) 8. Mehrotra, R., Namuduri, K., Ranganathan, N.: Gabor filter-based edge detection. Pattern Recognition 25, 1479–1494 (1992) 9. Cesacent, D., Smokelin, J.: Neural net design of gabor wavelet filters for distortioninvariant object detection in cluster. Optical Engineering 33, 2264–2271 (1994) 10. Krueger, V., Sommer, G.: Gabor wavelet network for object representation. In: DAGM Symposium, Germany, pp. 13–15 (2000) 11. Liu, H.: Defect detection in textiles using optimal gabor wavelet filter. In: IEEE Proceedings of the 6th World Congress on Intelligent Control and Automation, Dalian, China, pp. 10005–10007 (2006) 12. Yang, J., Zhang, D., Frangi, A.: Two-dimensional pca: A new approach to appearance-based face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 131–137 (2004) 13. Lee, T.c.: Fabric defect detection by wavelet transform and neural network. Master’s thesis, University of Hong Kong (2004) 14. http://www.centraltextiles.com/ 15. Tom, F.: ROC Graph: Notes and Practical Considerations for Researchers. Kluwer Academic Publishers, Dordrecht (2004)
Introducing Confidence Maps to Increase the Performance of Person Detectors Andreas Zweng and Martin Kampel Vienna University of Technology, Favoritenstr. 9/183, A-1040 Vienna, Austria [email protected], [email protected] http://www.caa.tuwien.ac.at/cvl/
Abstract. This paper deals with the problem of computational performance of person detection using the histogram of oriented gradients feature (HOG). Our approach increases the performance for implementations of person detection using a sliding window by learning the relationship of sizes of search windows and the position within the input image. In an offline training stage, confidence maps are computed at each scale of the search window and analyzed for a reduction of the number of used scales in the detection stage. Confidence maps are also computed during detection in order to make the classification more robust and to further increase the computational performance of the algorithm. Our approach shows a significant improvement of computational performance, while using only one core of the CPU and without using a graphics card in order to allow a low-cost solution of person detection using a sliding window approach.
1
Introduction
A sliding window approach is used in order to scan an image to find a trained model within the search window. The histogram of oriented gradients feature (HOG) is used to train a model (e.g. people) and to find this model in the detection stage using a sliding window [2]. Several modified implementations extend the work of Dalal et.al. [7] [4]. In case of different scales of the model within the image, the search window has to be slided through the image in different scales. The number of scales and the values of the scaling factors can be defined manually if the range of sizes within the image is known, otherwise all possible scales of the search window have to be processed, which requires a huge amount of computational performance. In case of the PETS 2009 dataset the used model, which is of size 128 by 64 pixels, requires 31 different scales to cover all possible sizes within the input image which is of size 768 by 576 pixels. Our approach aims to optimize the number of scales so that each position in the image is scanned with a limited number of scales dynamically computed from the confidence maps (CM). For performance enhancement, in [8] a special multi core processor is used in order to split computation and to enhance the computational performance. Another algorithm using parallelization using multi core systems is introduced in [9] where AdaBoost was used for person detection. GPU based improvements G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 446–455, 2011. c Springer-Verlag Berlin Heidelberg 2011
Introducing CM to Increase the Performance of Person Detectors
447
have been developed in [1] and [6] where GPUs are optimal for the computation of the histogram of oriented gradients due to the fast floating point arithmetic. Our main contribution in this paper is an enhancement of the computational performance of person detection algorithms using a sliding window approach for low cost hardware which is not possible for the above mentioned approaches. Our algorithm can additionally also be used with GPU-enhanced implementations of person detection algorithms as well as multi core enhanced implementations, since the performance increase is achieved using a greedy algorithm and without expensive hardware. The computational performance of person detection can therefore additionally be increased using better hardware. The paper is structured as follows: Section 2 describes the methodology of our algorithm. Section 3 shows results on computational performance as well as detection performance and section 4 concludes the paper.
2
Methodology
Our approach consists of an offline training stage where CM are computed in order to minimize the number of sliding window scales which results in less computations. Our method also has a detection stage where CM are computed in order to improve the robustness of the classifier and to further improve the speed of the algorithm by using the temporal information provided by the classifier, which are the positive responses from the classifier described in section 2.3. 2.1
Confidence Maps
The output of the classifier using a sliding window approach is the confidence of a positive match at that particular position with the pre-trained model at a particular scale. A confidence map is a matrix which represents the matching confidences in the image for each position of the sliding window. The number of CM is equivalent to the number of scales for the sliding window. Depending on the stage in the algorithm (offline training or detection stage), the CM represent different confidences, which will be defined in the following sections. 2.2
Offline Training
During offline training, sequences of images are analyzed in order to minimize the number of scales and so to minimize the computational effort. For training a CM is computed for each sliding window scale. Each element in each trained CM is the sum of all positive confidences for the input frames in the training sequence. (see. Equation 1) n ∀(C(x,y)s > 0) (1) CT(x,y)s = f =0 C(x,y)s In Equation 1, C(x,y)s and CT(x,y)s represent the actual CM and the final trained CM for scale s respectively, where x is the index of the horizontal resolution and y is the index of the vertical resolution of the CM at the actual scale s. f is the
448
A. Zweng and M. Kampel
Fig. 1. 30 confidence maps (10 by 3) computed from training
index of the actual frame of the training sequence and n is the number of total frames. Figure 1 illustrates the trained CM of a sequence from the PETS 2009 dataset. The figure displays the first 30 CM (out of 31) in a 10 by 3 matrix. The sizes of the maps decrease with increasing size of the search window scale since the window fits more often into the image at a lower scale. The CM have been resized to a uniform size for further processing. The value of the elements in the CM are linear to the brightness in the figure, where dark areas denote low confidences and bright areas denote high confidences relatively to each other. A sample of the image sequence from which the CM in Figure 1 have been computed is illustrated in Figure 2(a).
(a) Low camera view
(b) High camera view
Fig. 2. Sample images of the input sequence
The next step in the offline training stage is to find the most probable sliding window scale for each position in the input image. Therefore all trained CM are resized to the size of the biggest map and for each element the index of the scale with the highest value is stored in a new matrix which we will call the index map (see Equation 2). ∀x, y : IMx,y = max ns=1 CR(x,y)s
(2)
IMx,y denotes the element at horizontal position x and vertical position y of the index map and CR(x,y)s is the resized CM at scale s and position (x, y). The resulting index map is illustrated in Figure 3(a).
Introducing CM to Increase the Performance of Person Detectors
(a) Index map computed from a PETS 2009 sequence
449
(b) Interpolated index map
Fig. 3. The index map and the interpolated index map using thin-plate smoothing splines
The values in this map are the indices of the CM with the highest value within all CM at a particular position (the contrast of the image has been enhanced for illustration). Depending on the camera position, the indices decrease in a certain direction with a certain magnitude which is caused by the point of view of the camera. High values in the index map correspond to bigger persons than low values. In Figure 3(a) the sizes of the people in the corresponding image increase when they come closer to the camera which is affected by high values of the index map at the bottom of the map. This map should serve as a lookup table for the chosen search window scale. As a final step, binary confidence maps (BCM) are computed from the index map IMs which represent a lookup table for each sliding window scale s (see. Figure 4), where ones (white pixels in the maps) imply a comparison of the HOG feature at that position in IMs with the pre-trained model and zeros (black pixels in the maps) imply to skip the computation at that position in IMs .
Fig. 4. Binary confidence maps serve as lookup tables for the sliding window approach
The image illustrates the first 30 BCM in a 10 by 3 matrix. The BCM should serve as lookup tables, but the map at this stage of the algorithm cannot be used therefore, since the borders and other positions are zero, which means no person has been found at those positions in the training stage. The BCM should have a value at each position to know in which size a person appears at this position of the image. This problem can be solved by an interpolation of the index map
450
A. Zweng and M. Kampel
using the RANSAC algorithm for example [5]. Figure 5 illustrates a column of the index map from the top to the bottom of the index map.
Fig. 5. A column of the index map
In Figure 5 the points (rectangles) with the darker border denote the indices in the index map column where a person has been found and the points with the bright border denote the indices in the index map column where no person has been found in the training sequence (zero values). The interpolation using the RANSAC algorithm (which is similar to a least-squares approximation) results in the binary index maps shown in Figure 6.
Fig. 6. Binary confidence maps computed from the interpolated index map using a least-squares approximation
While the least-squares approximation interpolates a flat plane through the image, the curve in Figure 5 demonstrates a nonlinear increase. Therefore we used thin-plate smoothing splines instead of the least-squares approximation to interpolate the final index map which is illustrated in Figure 3(b). Zero values are also considered in the algorithm which distorts the resulting index map. Another problem is that the thin-plate smoothing splines adapt to miss classifications in the training stage which are the higher values on the top of the index map in this sequence. A solution to the problems is to use a very long training sequence where all positions in the search space of the sliding windows are found at least once and outliers like the ones at the top of the index map in this sequence are suppressed by detecting persons more than once at each position in order to have a better chance to find the correct sliding window scale at each position in the image. The relating BCM computed from the index map shown in Figure 3(b) is illustrated in Figure 7.
Introducing CM to Increase the Performance of Person Detectors
451
Fig. 7. Binary confidence maps computed from the interpolated index map using thinplate smoothing splines
2.3
Computational Performance Optimization
The index map can be used to optimize the computational performance of the sliding window approach. It is therefore used to compute binary maps which are filled with ones in areas where the computation for a matching should be done and filled with zeros in areas where the search window of that particular scale should skip the computation. The increase of the computational performance is shown in the evaluation section. An additional performance enhancement has been implemented using the information of previous frames. Each element in each CM retrieves a probability value for the next frame. A probability of 1.0 means that the actual element at the actual map will be processed for the next frame, a value of 0.5 means that the element will be processed by a chance of 50% in the next frame and a value of 0 means that the element will be processed by a chance of pt in the next frame, where pt is a threshold which is set to 0.1 for our sequences and denotes the lower boundary for the probability of the actual element to be processed. The probability is computed as follows. For each element in each CM an element is stored in another map which we call a delay map. The delay map is initialized with zeros and increases an element by the value of inc if the corresponding element in the CM is 0 or lower, which means that no person has been found using the HOG feature at that particular position, otherwise (if a person has been found at that position and scale) the delay map is set to zero at the actual position and at each position of the 4-neighborhood. inc is a parameter which denotes the learning speed of the algorithm. It is dependent on the frequency of persons moving around in the image. In our case for the PETS 2009 image sequences we have set inc = 0.005. An element of the delay map reaches a maximum at 1.0 − pt which is 0.9 in our case and means that there is a chance of 90% that the actual element will be skipped for computation. The value inc = 0.005 means that an element in a particular CM can be 0 or less (which is the value of the output of the classifier) for 180 frames until it reaches the maximum of 0.9. The advantage of the delay map is that it learns where persons are moving around in the sequence and therefore adapts the computation by skipping elements in the CM where no person has been detected in the previous frames. The delay map is additionally used to improve the detection rate of the person detector using the HOG feature. For further steps in visual surveillance it is
452
A. Zweng and M. Kampel
beneficial to detect a person in each frame. Person detectors may not detect a person in a certain position which has a high deviation compared to the trained model. This drawback can be improved with tracking algorithms. However, since the delay maps are part of the performance increase, they have to be computed and so they can also be used for improving the detection rate. The improvement is done for detections which return confidences between -1 and 0 and is done as follows: if the current confidence cc is higher than −1.0 and (cc ∗ dm(x, y) > −1.0) || (cc ∗ dm(x, y) < 1.0 & & cc ∗ dm(x, y) > 0.0) then a person is detected, where dm(x, y) is the delaymap at position (x, y).
3
Evaluation
Evaluation has been done for computational as well as for classification performance. Since the algorithm can also be applied on performance enhanced algorithms using the GPU or multicore processors, the evaluation has been done using the implementation of the person classifier from the opencv framework and a modified version of that implementation using our improvements. For detection and training, different sequences have been used. The computational performance depends on the number of used CM which is the reason why we chose the following sequences. 3.1
Low Camera View
A low camera view results in a high range of detection window sizes. Objects close to the camera are much bigger then in the background. It is therefore required to keep more CM for detection than for a small range of detection window sizes. An example image of the used sequence is illustrated in Figure 2(a). The computational performance is shown in Figure 8. The computational performance of the standard implementation (stdimpl) is around 1 frame per second for this image sequence. Our implementation using the CM (conf maps) is around 5 frames per second within a boundary of ± 0.2 frames per second, while the implementation with the additional performance enhancement using the delay maps (localimpr) increases performance over time. After approximately 300 frames, the performance converged to 5.8 frames per second within a boundary of ± 0.5 frames per second. The detection rates are shown in Table 1. The detection rate using the implementation with the CM is 0.37839 and therefore worse than the standard implementation. This is due to the fact, that the search space has been reduced where each position in the image is only processed with 1 to 3 search windows (depending on the overlapping region of the binary maps). However, the false positive rate is 0.03171 and therefore much lower then the false positive rate for the standard implementation. The detection performance increasing implementation using the delay maps achieves a slightly higher detection performance than the standard implementation but also has a higher false positive rate. This can be explained by the fact, that people are
Introducing CM to Increase the Performance of Person Detectors
453
7
6
fps
5
4
3 std impl conf maps local impr
2
1
0
100
200
300
400 frame #
500
600
700
800
Fig. 8. Computational performance of the initial implementation and the performance enhanced implementations Table 1. Classification performance using the histogram of oriented gradients and modifications Method std impl conf maps local impr
Detection rate False positive rate 0.46350 0.06518 0.37839 0.03171 0.47001 0.07109
walking and the delay maps are set to zero at the current position of a positive match. The delay maps are then shifted to the actual position of the person. 3.2
High Camera View
A high camera view results in a low range of detection window sizes, since people are far away from the camera. Compared to a low camera view less CM for detection have to be computed. An example image of the camera view of the used sequence is illustrated in Figure 2(b). The computational performance is shown in Figure 9. The computational performance of the standard implementation (stdimpl) is around 1 frame per second for this image sequence which is the same as for the first image sequence. The performance increased implementation using the CM (conf maps) is around 6.8 frames per second within a boundary of ± 0.25 frames per second, while the implementation with the additional performance enhancement using the delay maps (localimpr) increases performance over time similar to the first image sequence. After a period of approximately 250 frames, the performance converged to 8.5 frames per second within a boundary of ± 0.6 frames per second. The performance boost is higher than in the first sequence due to the fact, that for high views, the range of people sizes decrease and therefore also the number
454
A. Zweng and M. Kampel
10
9
8
7
fps
6
5 std impl conf maps local impr
4
3
2
1
0
0
100
200
300
400 frame #
500
600
700
800
Fig. 9. Computational performance of the initial implementation and the performance enhanced implementations Table 2. Classification performance using the histogram of oriented gradients and modifications Method std impl conf maps local impr
Detection rate False positive rate 0.44210 0.03532 0.39729 0.02394 0.43115 0.05182
of used CM. The algorithm can skip the preprocessing steps of the unused CM such as resizing the image, which is part of the detection process using multiple scales of the search window. The detection rates of the initial implementation in opencv and the modified versions for the second image sequence are shown in Table 2. The relation of the detection rates of the different modifications of the algorithm are similar to the first sequence. In the second image sequence, the detection performance of the implementation using the delay maps is slightly worse, than the initial implementation, while the false positive rate is again higher compared to the initial implementation.
4
Conclusion
We introduced a novel approach to improve the computational performance of detectors using a sliding window. The approach retrieves worse classification results using the CM only and slightly better classification results while using the additional improvement with the delay maps. Computational performance is increased in all cases since the algorithm narrows the search space for the sliding windows. Additionally manual parameterization is not necessary anymore since the CM include the spatial information of detection window sizes. Performance
Introducing CM to Increase the Performance of Person Detectors
455
increasing implementations using the graphics card, multi core-system or other algorithmic enhancements like the improvement using integral histograms [3] can additionally be used with our algorithm in order to further increase the performance. However, our goal was to develop an algorithm for people detection for low cost hardware in order to use the algorithm in a smart cam for example. Future work includes a solution to the problem of the increasing false positive rate while using the delay maps. The position of the person in the next frame has to be predicted in order to overcome the shifting of the delay maps, which is a part of tracking already.
References 1. Bauer, S., Kohler, S., Doll, K., Brunsmann, U.: FPGA-GPU architecture for kernel SVM pedestrian detection. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 61–68 (2010) 2. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), pp. 886–893 (2005) 3. Porikli, F.: Integral Histogram: A Fast Way To Extract Histograms in Cartesian Spaces. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), pp. 829–836 (2005) 4. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8 (2008) 5. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography, technical report, AI Center, SRI International (1980) 6. Prisacariu, V., Reid, I.: fastHOG - a real-time GPU implementation of HOG, technical report, Department of Engineering Science, Oxford University (2009) 7. Wang, X., Han, T., Yan, S.: An HOG-LBP human detector with partial occlusion handling. In: 2009 IEEE 12th International Conference on Computer Vision (ICCV 2009), pp. 32–39 (2009) 8. Wilson, T., Glatz, M., Hoedlmoser, M.: Pedestrian Detection Implemented on a Fixed-Point Parallel Architecture. In: Proc. of the ISCE 2009, Tokyo, Japan, pp. 47–51 (2009) 9. Chen, Y.-K., Li, W., Tong, X.: Parallelization of AdaBoost algorithm on multi-core processors. In: 2008 IEEE Workshop on Signal Processing Systems (SiPS 2008), pp. 275–280 (2008)
Monocular Online Learning for Road Region Labeling and Object Detection from a Moving Platform Chung-Ching Lin and Marilyn Wolf School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA 30332 [email protected], [email protected]
Abstract. An online learning method is proposed for detecting the road region and objects on the road by analyzing the videos captured by a monocular camera on a moving platform. Most existing methods for moving-camera detection impose serious constraints or require offline learning. In our approach, the feature points of the road region are learned based on the detected and matched feature points between adjacent frames without using camera intrinsic parameters or camera motion parameters. The road region is labeled by using the classified feature points. Finally, the feature points on the labeled road region are used to detect the objects on the road. Experimental results show that the method demonstrates significant object detecting performance without further restrictions, and performs effectively in complex detecting environment.
1 Introduction Object detection has been a focus in visual surveillance. Many sophisticated methods for detecting objects have been developed for static cameras (e.g. [1]). But it is difficult to generally apply existing methods to detect objects from videos captured by a graylevel mono camera on a moving platform. Yamaguchi et al. [2] propose a road region detection method by estimating the 3D position of feature points on the road. Then, feature points and epipolar lines are utilized to detect moving objects. This method assumes that there is no moving obstacle in the initial frame and that the road region in the initial frame is decided according to the height of the camera that is measured when the vehicle is stationery. However, when these assumptions are violated, the application of this method would be restricted due to presence of moving obstacles in the initial frame or change of camera height. Kang et al. [3] use multiview geometric constrains to detect objects. However, the approach is non-causal since future information is required in this approach. Ess et al.[4] develop a robust algorithm for detecting and tracking pedestrians from a mobile platform. However, this algorithm is developed for a stereo rig, and the calibration of the stereo rig is required in order to use depth information in this algorithm. Wojek et al. [5] propose a method to perform 3D scene modeling and inference by using a monocular camera in a car. This method uses the trained features to label the road and sky, and to detect objects in the scene. But, the features in this method need to be trained offline. One of the main disadvantages of offline training method is the need to collect and train data in advance for a specific application. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 456–465, 2011. c Springer-Verlag Berlin Heidelberg 2011
Monocular Online Learning for Road Region Labeling and Object Detection
457
In order to overcome such problems and generate effective results without the abovementioned restrictions, a new approach is developed in this paper. We propose an online learning method for detecting the road region and objects on the road without using any camera intrinsic parameters or camera motion parameters. In particular, the online learning method can adapt to various environments, and the method does not require information of camera parameters. The combination of these strengths enables the proposed method to be generally applied to detect objects from videos captured by a camera on a moving platform. Next, the algorithm of proposed method and experiment results are presented.
2 Overview of Proposed Method The process flow of proposed method is shown in figure 1. The proposed method contains four parts: key feature point learning, feature point classification, road region labeling, and object detection. An online learning method is proposed to detect the road region and the objects on the road. After the key feature points of the road are learned, the features of the key feature points are used to classify the rest of feature points as either ”road” or ”nonroad”. Based on the classification results, the road region will be labeled. Then, the objects on the road will be detected using the labeled results. First, feature point detection and matching are performed in adjacent frames. A probability model based on Bayesian rule is proposed to learn the key feature points. The key feature points are then used to classify the rest of feature points applying conditional probability model. Then, road region boundaries are defined using the classified feature points. Based on the detected road region, the objects on the road are detected by exploiting the outliers of feature points on the road.
Fig. 1. Flow of proposed method
3 Key Feature Points Learning In order to develop a probability model to perform the key feature point learning, the characteristics of the road region are considered. In general, less feature points can be detected on the road region because the road region is flat and has less texture. And, less feature points are matched on the road region. In other words, the similarity of road region causes the higher rate of mismatching the feature points on the road region. As a result, the matched feature motion vectors have less angle regularity on the road region. Therefore, we use the density of matched feature points and angle regularity of matched feature points to learn the characteristics of the feature points on the road.
458
C.-C. Lin and M. Wolf
Based on the Bayesian rule, the posterior distribution for the scene state X given image evidence η in terms of a prior P (X|ς) and an observation model P (η|X, ς) is defined as: P (X|η, ς) ∝ P (X|ς) · P (η|X, ς), (1) where ς is pixel position. The scene state X consists of the states of road region. The goal of this work is to infer the state X from video captured by a monocular, forward-facing camera in a car. The camera is uncalibrated, and the camera motion parameters are unknown. Meanwhile, we avoid estimating the background structure of the scene. Without knowing any intrinsic and extrinsic parameters, the algorithm is developed using the characteristics of the feature points. Because the camera is forward-facing, the probability of the road region P (X|ς) can be assumed to follow a normal distribution with mean at the bottom of the image. P (X|ς) ∝ N (V ; μV , σV ),
(2)
where V is the vertical position of ς. The observation model P (η|X, ς) fuses the feature density and angle regularity properties of the matched feature points: P (η|X, ς) = ψ(d|ς) · ψ(ω|ς).
(3)
The feature density potential ψ(d|ς) models the density of matched feature points given the pixel position ς. The feature density potential is defined as: ψ(d|ς) = eκ¯ ς ,
(4)
where κ ¯ ς is the number of the matched feature points within the window Wς with size ws and postion ς. The angle regularity potential ψ(ω|ς) describes how well the matched feature points around pixel position ς satisfy the angle regularity. The angle regularity potential is defined as: ¯ (5) ψ(ω|ς) = e−Δθ = e− ni ∈Wς |θni −θς | , where θ¯ς is the average angle of feature motion vecters within the window Wς . The inference probability can be defined as: P˜ (X|η, ς) = P (X|ς) · P (η|X, ς).
(6)
Pˆ (X|η, ς) is the normalized form of log(P˜ (X|η, ς)), and is used to learn the key feature points. Pˆ (X|η, ς) =
log(P˜ (X|η, ς)) − min(log(P˜ (X|η, ς))) . max(log(P˜ (X|η, ς))) − min(log(P˜ (X|η, ς)))
(7)
The key feature points τi are defined as the matched feature points with Pˆ (η|X, ς) smaller than the threshold Tk . τ = {nj : Pˆ (X|η, nj ) < Tk , ∀ j}, where nj is the j
th
detected feature point.
(8)
Monocular Online Learning for Road Region Labeling and Object Detection
459
4 Feature Point Classification After the key feature points τ are learned, the characteristics of the key feature points are exploited to classify the rest of feature points. In this paper, a cascade framework is adopted to classify the feature points as [6]. A cascade classifier can increase detecting performance and radically reduce computational time. In offline training methods, classifiers are trained with annotated data by using techniques like SVM or AdaBoost etc. Those techniques are not appropriate in our case because we do not have annotated data in this method. We classify feature points using a particle filter. Every feature of the key feature points is treated as an equally weighted particle in the probability model. Our cascade classifier is shown in Figure 2. Two classifiers are cascaded: one uses the coefficients of Walsh-Hadamard transform, and the other uses the coefficients of Haar Wavelet transform. The popular HOG feature is not used as our classifier because the road region does not have rich texture. The coefficients of the Walsh-Hadamard transform (WHT) [7], and the diagonal, horizontal and vertical coefficients of the Haar Wavelet transform (HWT) [8] are computed as the features for classification. For classification purpose, the conditional inference probability models are applied to infer the likelihood between feature points.
Fig. 2. Cascade Classifier
as:
The logarithm of the conditional inference probability for the WHT feature is defined −log (PW H (r|nj , {τi })) = fW H (τi ) − fW H (nj )1 , (9) i
where fW H (τi ) is the WHT feature at the position τi and ·1 is 1-norm. The logarithm of the conditional inference probability model for the HWT feature is defined as: fHW (τi ) − fHW (nj )1 , (10) −log (PHW (r|nj , {τi })) = i
where fHW (τi ) is the HWT feature at the position τi . In the first classifier, the logarithms of the conditional inference probabilities are used to classify the detected feature points. γW H is the set of the feature points that are classified as the feature points on the road using WHT features. γW H = {nj : −log (PW H (r|nj , {τi })) < TW H , ∀ j}.
(11)
In the second classifier, the outputs of the first classifier are further classified using HWT features. j j γ = {γW < THW , ∀ j}, (12) H : −log PHW (r|γW H , {τi }) where γ is the set of the feature points that are classified as the feature points on the road.
460
C.-C. Lin and M. Wolf
WHT features are computed from the first 16 coefficients of the Walsh-Hadamard transform. This transformation is a discrete approximation of the cosine transform and can be computed efficiently. Before the WHT features are calculated, the input image is normalized with zeros mean and unit variance. Haar Wavelets have been introduced by Papageorgiou and Poggio [8] for people detection. The diagonal, horizontal and vertical coefficients of the Haar Wavelet transform are used as the HWT features. HWT features are computed from the absolute responses of horizontal, vertical and diagonal wavelet types.
5 Road Region Decision After the feature points on the road γ are classified, the feature points on the road are used to define the boundaries of the road region. Then, a road-labeled map can be generated by the boundaries of the road region. In this paper, we focus on the application of the front-facing cameras in a car. The road region on the image plane is non-increasing from bottom to top. An algorithm is developed to define the boundaries of the road region. The car is moving forward; therefore, the region closer to bottom of the image plane has higher probability to be road. First, the boundaries of the road region are decided from bottom to top orderly. During the first k steps, the boundaries are purely decided by the region of feature points on the road. After the first k steps, the boundaries of road region are decided with considerations of previous boundaries. The objects will affect the decision of road boundaries, if they are on the road. The following procedures are designed to prevent that the feature points on the objects affect the boundary decision. If the boundaries shrink too much, we search feature points in γHW within previous boundary plus a margin. If there is no feature point within that region, the boundary is set by a portion of the previous shrinking rate. In Algorithm 1, LBj and RBj represent the left boundary and right boundary at (x) th j step. m is a margin. α is a positive scalar small than one. γHW,i is the horizontal position of feature point γHW,i . {γi }j is defined as: (x)
{γi }j = {γi : γi
∈ [LBj−1 − m, RBj−1 + m] , ∀ i}.
(13)
After all boundaries are defined based on the algorithm 1, the boundaries are smoothed.
6 Object Detection After the road region is defined, the objects on the road can be identified by the outliers of the feature points on the road region. But, the feature points on the road stripes are outliers as well. A filtering method can be used to detect the outliers on objects or on the road stripes. We apply a 2D rectangle filter on outliers map, inliers map, and road region map. The output of the filtering results is used to filter out the outlier feature points on the road stripes. The filtered feature point outliers are clustered. The size of cluster smaller than threshold Ts are discarded. Hierarchical clustering method in [9] is then used to group feature points on the objects.
Monocular Online Learning for Road Region Labeling and Object Detection
461
Algorithm 1. Road Boundary Decision first k steps: road region are decided by the classified feature points {γ} for j = k + 1 to height/step do search {γi }j (x) (x) LBj = min({γi }), RBj = max({γi }) if LBj − LBj−1 > thb then (x) if search γHW,i from LBj+1 − m to RBj−1 + m then (x)
LBj = γHW,i else LBj = LBj−1 + α · (LBj−1 − LBj−2 ) end if end if if RBj−1 − RBj > thb then (x) if search γHW,i from RBj−1 + m to RBj−1 − m then (x)
RBj = γHW,i else RBj = RBj−1 − α · (LBj−2 − LBj−1 ) end if end if if no γi is found within [LBj−1 − m, RBj−1 + m] then LBj = LBj−1 , RBj = RBj−1 , break end if end for
7 Experiments This section presents experiment results obtained from the proposed method. The video streams were captured by a camera in a forward-moving car and the camera was held by a human hand. The car speed is about 10 to 35 MPH. The videos are recorded at a frame rate of 10Hz and a resolution of 640x480 pixels. Because the road is uneven and human hand is unstable, the captured video streams have a lot of sudden irregular movements. The relative movements between objects and the camera are complex and change rapidly. In these experiments, Tk is 0.5, TW H is 0.75, THW is 2, m is 60, step is 20, ws is 60, and α is 0.7. The feature points are detected and matched by SURF algorithm [10]. Figure 3 (a) shows the results of matched feature points. As one can see, the road region has less matched feature points and higher mismatching rate because the road region is flat and has less texture. This figure demonstrates the characteristics of the feature points on the road which we utilize to develop the learning algorithm. These matched feature points are then used to calculated the inference Pˆ (X|η, ς). Figure 3 (b) shows the distribution of Pˆ (X|η, ς). In video of experiment 1, a car in front of the camera is moving forward. In video of experiment 2, two cars in front of the camera are moving forward. In video of experiment 3, a car is moving forward and another car is moving toward the camera. Figure 4, Figure 5 and Figure 6 show results of experiment 1, 2 and 3 respectively. The original images are shown in Figure 4(a), 5(a), and 6(a). Figure 4(b), 5(b), and 6(b) show the
462
C.-C. Lin and M. Wolf
(a)
(b)
Fig. 3. Experiment (a) matched feature points, (b) the distribution of Pˆ (X|η, ς)
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. Experiment 1 (a) original image, (b) matched feature points, (c) key feature points, (d) feature points on the road region, (e) road region, (f) detection
matched feature points. The black-starred feature points are detected and matched feature points by using SURF algorithm. The black-starred feature points in Figure 4(c), 5(c), and 6(c) show the learned key feature points τ . Our learning process can provide more reliable and representable feature points for classification. Therefore, as one can
Monocular Online Learning for Road Region Labeling and Object Detection
(a)
(b)
(c)
(d)
(e)
(f)
463
Fig. 5. Experiment 2 (a) original image, (b) matched feature points, (c) key feature points, (d) feature points on the road region, (e) road region, (f) detection
see, the numbers of the learned key feature points are relatively small in comparison of the number of feature points on the road region. The learned key feature points τ are used to classify the rest of detected feature points ni . Figure 4(d), 5(d), and 6(d) show the classified feature points. The black-starred feature points are the feature points classified as the points on the road γ. In these figures, most feature points on the road are classified correctly. Feature points on the objects and some feature points on the road mark are classified as outliers. Figure 4(e), 5(e), and 6(e) show the results of the detected road region. The road region is marked with black dots. The road region is defined by the classified feature points. As one can see, although there are some classified feature points not on the road, the road region still can be decided correctly. Detected objects are shown in Figure 4(f), 5(f), and 6(f). These figures show that the feature points on the road mark are filtered out successfully, and objects are detected. As the experiment results show, the proposed method can successfully detect single or multiple objects on the road. In addition, no matter the objects are moving forward or
464
C.-C. Lin and M. Wolf
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6. Experiment 3 (a) original image, (b) matched feature points, (c) key feature points, (d) feature points on the road region, (e) road region, (f) detection
moving toward the camera, the proposed method is able to perform significant detecting results. After the objects on the road are detected, the objects can be accurately tracked by using the proposed method in [11].
8 Conclusions and Discussion In this paper, we have proposed a novel method to effectively detect objects on the road from videos captured by a camera on a moving platform. The road region can be detected without using any camera intrinsic and motion parameters. Experiment results show the propose method has significant detecting performance. There is no need to impose initial assumptions or to apply future frame information in the detecting algorithm. And, the online learning method can adapt to various environments. Thus, the proposed method could be generally applied to detect objects with irregular camera movement and in complex environment. Future research is aimed at integrating object detecting and tracking systems for moving cameras.
Monocular Online Learning for Road Region Labeling and Object Detection
465
References 1. Li, L., Huang, W., Gu, I., Tian, Q.: Foreground object detection from videos containing complex background. In: Proceedings of the ACM International Conference on Multimedia, pp. 2–10 (2003) 2. Yamaguchi, K., Kato, T., Ninomiya, Y.: Vehicle ego-motion estimation and moving object detection using a monocular camera. In: IEEE International Conference on Pattern Recognition, vol. 4 (2006) 3. Kang, J., Cohen, I., Medioni, G., Yuan, C.: Detection and tracking of moving objects from a moving platform in presence of strong parallax. In: IEEE International Conference on Computer Vision (2005) 4. Ess, A., Leibe, B., Schindler, K., Van Gool, L.: Robust Multi-Person Tracking from a Mobile Platform. Pattern Analysis and Machine Intelligence 31, 1831–1846 (2009) 5. Wojek, C., Roth, S., Schindler, K., Schiele, B.: Monocular 3D scene modeling and inference: Understanding multi-object traffic scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 467–481. Springer, Heidelberg (2010) 6. Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision 57, 137–154 (2004) 7. Alon, Y., Ferencz, A., Shashua, A.: Off-road Path Following using Region Classification and Geometric Projection Constraints. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, (IEEE) pp. 689–696 (2006) 8. Papageorgiou, C., Poggio, T.: A trainable system for object detection. International Journal of Computer Vision 38, 15–33 (2000) 9. Lin, C., Wolf, M.: Belief Propagation for Detecting Moving Objects from a Moving Platform. In: International Conference on Image Processing, Computer Vision, and Pattern Recognition (2010) 10. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. Computer Vision and Image Understanding 110, 346–359 (2008) 11. Lin, C., Wolf, W.: MCMC-based Feature-guided Particle Filtering for Tracking Moving Objects from a Moving Platform. In: IEEE International Conference on Computer Vision Workshop (2009)
Detection and Tracking Faces in Unconstrained Color Video Streams Cornélia Janayna P. Passarinho, Evandro Ottoni T. Salles, and Mário Sarcinelli-Filho Universidade Federal do Espírito Santo, Campus de Goiabeiras, Avenida Fernando Ferrari, s/n, 29075-910, Vitória, ES, Brasil {janayna,evandro,mario.sarcinelli}@ele.ufes.br
Abstract. This paper proposes a method combining local SVM classifiers and a Kalman filter to track faces in color video sequences, which is referred to as the Dynamic Local Support Vector Tracker (DLSVT). The adjacent locations of the target point are predicted in a search window, reducing the number of image regions that are candidates to be faces. Thus, the method can predict the object motion more accurately. The architecture presented good results for both indoor and outdoor unconstrained videos, considering multi-view scenes containing partial occlusion and bad illumination. Moreover, the reduction of the image area in which the face is searched for results in a method that is faster, besides being precise.
1 Introduction Human-face detection and tracking plays an important role in many applications, such as video surveillance, face recognition, and face identification [1]. The foregoing works consider mainly the detection and tracking of one frontal faces [2],[3],[4],[5]. Such a restriction may limit their practical use because faces in images can occur with various poses, like in-plane or out-of-plane rotations, or under various situations, such as lighting conditions, facial expressions and occlusions. So, the visual appearances and features of faces could vary enormously when considering the environment in which the image is captured. For instance, Viola and Jones [5] use a scheme in which the computation time is reduced, with the disadvantage that it is extremely difficult to get good performance when the face is not in frontal view. Other restriction in the works available in the literature is related to the target to detect. Several researchers have detected face by combining color-based methods to obtain high performance and high speed [6]. The advantages are that such methods are fast and have high detection ratio, although being limited in the presence of varying lighting and objects having a color that is similar to the color of the target (the face to be detected). Many papers present feature-based methods to detect faces [7],[8]. Specifically speaking, feature-based face detection demands huge computational effort, resulting in low-speed operation. In those cases, the problem of detecting faces has been replaced by the problem of detecting multiple, similarly complex and deformable, parts of a face [8]. Such methods are useful for facial analysis and feature correspondence G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 466–475, 2011. © Springer-Verlag Berlin Heidelberg 2011
Detection and Tracking Faces in Unconstrained Color Video Streams
467
in face identification, because detection and alignment of facial features demands images of relatively high spatial resolution. However, in dynamic scenes, face detection often needs to be achieved in a much lower resolution. Occlusions caused by changes in the viewpoint are the main problem with the local feature-based approaches, because correspondences among certain features do not exist under occlusion. In this paper, a face detection and tracking algorithm, the Dynamic Local Support Vector Tracker (DLSVT), is proposed, to detect human faces in color images under poor lighting conditions and different views. Such approach does not use face color model or deformable face parts to find faces in an unconstrained video. Instead, face image is the feature considered for SVM (Support Vector Machine) training. Several papers in the literature utilize SVM to detect face in a video sequence. However, it is used gray level videos, disregarding the constraints of the real world. Such methods do not address the ill posed problem of illumination changing, for instance. These methods also perform face detection in a sequence of images, but do not consider the displacement of individuals in the video. Differently, DLSVT deals with the problem of partial occlusion along with face tracking in a video. In the first frame, as a previous estimate of the face position is not available, the face is searched only in the image regions of skin color. The estimate of the face location for the next frame is then obtained by a Kalman filter. In order to decrease the computational effort a reduced search window method is also proposed. In such a case, once the algorithm finds a face the next search will use the reduced window. The reduced window decreases the search region for faces. This is guaranteed due to to search for face pixel is performed only in reduced skin pixel image window. The prediction function of the Kalman filter estimates the face location into the skin pixel search window, thus increasing the tracking rate and also enhancing the tracking performance. It is also used lighting compensation to improve the performance of the framework. The result is a method that is effective under facial changes, such as eyeclosing, glass-wearing, for faces having distinct profiles and under bright variation. Finally, to validate the proposed architecture, tracking results obtained with the proposed method applied to unconstrained outdoor and indoor video sequences are presented (it is also worth to emphasize that the tests here presented were performed on poor resolution color video sequences). The proposed method, the DLSVT, not only deals with the problem of partial occlusion, but also tracks the face of interest, thus being more complete than the method proposed in [9], for instance. That method uses SVM and particle filter to detect and track faces, restricted to gray level image sequences. Moreover, it is evaluated only under partial occlusion. Situations like faces in profile, tilted or at different scales are not considered. The preprocessing step of DLSVT is simpler than and as effective as, in terms of bad lighting compensation, the one presented in [10]. Such proposal uses the combination of GMM (Gaussian Mixture Models), background subtraction approach, subsampling frame and utilizes skin color detection along with mathematical morphology operations in the YCbCr color space. The two-step DLSVT image pre-processing is accomplished by using RGB color space only, as stressed in Section 2. Thus, it is not necessary to transform the RGB color space in any other one. DLSVT uses a combination of face detection through local SVM and a Kalman filter to track the faces of interest. The assumption of uniform displacement of individuals in the videos is
468
C.J.P. Passarinho, E.O.T. Salles, and M. Sarcinelli-Filho
enough to obtain satisfactory results. Thus, it is not necessary to apply a particle filter, whose computational cost is higher than Kalman Filter. The paper is hereinafter split in a few sections, to address the above mentioned topics. In Section 2 face detection is discussed. In the sequel, Section 3 briefly describes SVM, whereas the complete face detector and tracker is shown in Section 4. Finally, experimental results and conclusions are presented in Sections 5 and 6, respectively.
2 Face Detection In this work it is proposed a pre-processing method, using only RGB color space, to detect skin regions in the image being analyzed. After, such regions are analyzed to find a face or faces. First of all, considering that light reflected by the objects or persons in a scene varies with the illuminant, the method proposed in [11] is applied to each frame in the video sequence under analysis, to achieve a less instable object color perception. Such method is based on the equation C S = std , (1) Cavg where S is a scale factor for one specific color channel (R, G or B), and
Cstd and
Cavg are, respectively, the standard mean gray value and the mean value of the specific channel. After such pre-processing step, the algorithm found in [12], proposed to detect the skin region in a color image, uses thresholds on the RGB values of each pixel in the image to identify the skin regions. The thresholds there proposed, here applied to each image pixel, are 95 for R, 40 for G and 20 for B (R, G and B represent the value of the pixel in the respective RGB color channel, and their values range from 0 to 255). Next, if the absolute difference between the R and G values is higher than 15 and the R value is higher than both the G and the B values, the pixel under consideration is classified as skin. However, it is worth mentioning that this method does not present good performance without the previous step of light compensation. Another aspect deserving mentioning is that in this paper not only Caucasian skin tones are considered, as it happens in [13]. Next, it is used a detection algorithm, over the areas detected as being skin-color regions, to improve the performance of the tracking step. After the image preprocessing stage, it is used a Gabor filter bank to extract features (Gabor features are effective in 2D object detection and recognition, according to [14]). The outcomes obtained from the Gabor filters are presented to global and local SVM kernels to detect the faces. The Gabor features, effective in 2D object detection and recognition [14], are defined by ⎡⎛ − x 2 ⎞ ⎤ 1 ⎢⎜ ⎟ ⎥ exp ⎡ j 2π ( w T x + φ ) ⎤ , exp ψ ( x) = (2) ⎣ ⎦ ⎢⎜ 2σ 2 ⎟ ⎥ 2πσ 2 ⎠⎦ ⎣⎝ where x = ( x, y )T , φ = μπ
4
, w = (U ,V )T , j = −1 and μ = 0," ,3 .
Detection and Tracking Faces in Unconstrained Color Video Streams
469
In [9] it is reported that Gabor features of only one frequency level lead to a good performance in face recognition . Therefore, in the experiments here reported Gabor filters of four different orientations (µ = 0,…,3), with one frequency level, w, are used, for the sake of speeding up the recognition task. The size of the Gabor filters was set to 31×41×4 pixels, where 31×41 is the dimension of the face images considered (initial experiments used 15 by 15 pixel training images, in which the faces were more tightly cropped, but got slightly worse results). The position x that gives large Gabor outputs are different, depending on the orientation parameter φ of the Gabor filter. Thus, Gabor properties are suitable to enhance the recognition of different target poses in a video sequence.
3 Support Vector Machine In this paper SVM is used to separate face and non-face samples, due to its well known higher performance, compared to ANN (Artificial Neural Networks) methods, regarding binary classification [15]. SVM determines the optimal hyperplane that maximizes the distance between the hyperplane and the nearest sample, called margin [14]. When the training set (labeled samples) is denoted as S = ((z 1 , y1 ), " , (z L , y L )) , the optimal hyperplane is defined by f (z ) =
∑ αi y i K (zi , z) + b ,
i∈SV
(3)
where SV is a set of support vectors, b is the threshold and α i is the solution of a quadratic programming problem. The training samples with non-zero value are called support vectors. K (zi , z) is the inner product Φ(zi )T Φ(z) between the support vector zi and the input vector z, in high dimensional space. In our implementation, normalized linear kernel is adopted as the kernel function, which is defined as K ( z, y ) =
zT y z y
.
(4)
In order to use local kernels in SVM, a kernel value K (zi , z) is computed from local kernels K p (zi ( p))(z( p)) arranged at all positions of target recognition. It is considered that the local summation kernel is better than the local product kernel, because in local product kernel if some local kernels give low values, then the product kernel value becomes low. This means that the product kernel is more influenced by noise or occlusion. On the other hand, local summation kernel is not influenced when some local kernels give low values. This means that local summation kernel is more robust to deal with occlusion. Therefore, local summation kernel is selected to be used in this paper, whose dimension is 9×9×4, where 9×9 is a patch of the face image dimension, evaluated in the 4 orientations of the Gabor filter. Both the local summation kernel and the global linear kernel are considered to be used in the DLSVT. The decision function of SVM with local summation kernel is defined by f (z ) =
1
N
∑ αi y i N ∑ K p (zi ( p), z( p)) + b ,
i∈SV
p
(5)
470
C.J.P. Passarinho, E.O.T. Salles, and M. Sarcinelli-Filho
where N is the number of local kernels. From equation (5), we understand that the mean of local kernels is used as the kernel value. Finally, to implement the SVM classifier, it was chosen the well known Light SVM library [16].
4 Tracking a Detected Face Kalman filter [17] is the filter most commonly used to solve problems of optimum estimation. By using the Kalman filter the posterior location of the face in the frame is predicted based on the current position information. This step avoids the need of searching for the face in the entire image. At each time instant it is supposed that the face is moving with a constant velocity, which does not represent a problem to most cases of face tracking. Usually people do not move with abrupt movements. Nevertheless, in this work face tracking tests with abrupt motion presents satisfactory results. In this paper, the face detector and the face tracker are used simultaneously, to implement the Dynamic Local Support Vector Tracker, which is described in the 4 following steps:
Step 1. In the first frame, face is detected automatically. Another characteristic of the tracking method presented is that the face is not searched in the whole frame dimension. As a previous estimate of the face position is not available, because it is the first frame sequence, the face is searched for in the image regions of skin color only. This assumption decreases the computational effort. The face thus detected becomes the current observation for the Kalman filter, and it is obtained by using the SVM in each skin color region. Therefore, the skin color surrounding the face is all the regions of skin in the frame; Step 2. The estimate of the face location for the next frame is then obtained by the Kalman filter; Step 3. A new observation is achieved in the estimated point obtained in the previous step. If this new observation is not obtained, a search is performed in a window in the skin color vicinity, centered in the position estimated using SVM again (such search vicinity is bounded by a window of dimension 80×60 pixels); Step 4. If the target is detected in the region of interest, the algorithm returns to Step 2. Otherwise, the algorithm returns to Step 1, to get a new initial observation.
5 Results Face detection has two measures for evaluation: false positive rate (FPR) and true positive rate (TPR). A false positive means that a non-face sample is misclassified as the face class. A true positive means that a face sample is correctly classified. To evaluate these two measures simultaneously, a Receiver Operating Characteristic (ROC) curve is used. Therefore, the performance of a classifier becomes a curve in
Detection and Tracking Faces in Unconstrained Color Video Streams
(a)
471
(b)
Fig. 1. ROC curve for (a) the global kernel, and (b) the local linear kernel
the FPR-TPR plane. Here the SVM with global kernel and the SVM with the summation of local kernels are evaluated (Figure 1 shows the ROC curves for these two cases). The horizontal axis shows FPR and the vertical axis shows TPR. High TPR and low FPR means good performance. Therefore, the upper left curve corresponds to the best one. The ROC curve in Figure 1(a) shows that the SVM with global kernel outperforms the one based on the summation of local kernels. From the tests reported in the sequel, it can be seen that the use of the global features gives the best accuracy under view, illumination and scale changes. In other words, the effectiveness of the proposed DLSVT method is checked. The size of the test images used is 240×320 pixels. Two of the test video sequences used were captured using a common camera in an indoor and an outdoor environment (see Figures 2(c) and 2(a), respectively), and the third one is the HONDA video sequence [18] (see Figure 2(b)). Such video sequences were chosen because they present complex face movement, scale variation, partial occlusion and face view changing. For training the classifier, face and non-face images of 31×41 pixels, taken from videos and some face databases [6], are used. The face regions of these images are cropped by using the positions of the nose. In the sequel, four Gabor features are obtained from each image. Next, we prepare the face and non-face images for training the SVM. In this experiment, Gabor features are used, and global and local SVM are applied to each one of the outputs of the Gabor filters. In spite of all image changes along the video sequences used in the test of the proposed face tracker, due to body movements, light intensity changes, and even partial occlusion, it was able to effectively track the face of a person. The results are also satisfactory when considering the sequence in Figure 2(a), which corresponds to an outdoor environment, where light conditions are much variable and shadows are constantly appearing in the scene, making more difficult to detect the face. The objective of this test is to check the robustness of the proposed tracker in real situations. In this case, the man in the video sequence moves away from the camera, the background of the scene presents several different textures and the illumination is frequently changing due to the shadows surrounding the man. Finally, it is worth mentioning that the snapshots from this video sequence present scale variation and the camera was not fixed while capturing the image frames. The sequence showed in Figure 2 (b) is the
472
C.J.P. Passarinho, E.O.T. Salles, and M. Sarcinelli-Filho
(a)
(b)
(c) Fig. 2. Snapshots of the three face tracking test videos. (a) Outdoor video sequence, (b) HONDA video sequence and (c) indoor video sequence with partial occlusion.
HONDA data video. The woman in the snapshots is sat in front of the camera in an office. The background here is quite complex, including some windows in the room. This means that the environment receives natural and artificial illumination at the same time. In this test, the proposed method detects and tracks the face even under changing brightness. Finally, the detection and tracking is successful even when the person looks to some point in the wall and moves the forehead to look up. It is also checked the effectiveness of the proposed method under face rotation and partial occlusion, using the third test video, an indoor sequence of frames presenting such situations (see Fig 2 (c)). In spite of such problems, the target face is correctly tracked through the frames, as exemplified by the snapshots shown in the figure. Furthermore, an assessment considering the real target trajectory in the video and the face position estimated by the proposed method is presented. First, the nose position of each individual is regarded as the real face position. In Fig. 3, it is possible to observe the tracking results for the first video sequence used in this work. It should be noticed that the estimated positions (triangles) in the picture have been shifted related
Detection and Tracking Faces in Unconstrained Color Video Streams
473
Fig. 3. Estimated (triangles) and real (stars) face trajectories in the outdoor video sequence (left) and in the HONDA one (right)
to the to real face position (stars), what is not a drawback for DLSVT. According to the figures, the faces are found out in a search window centered in the position achieved by the Kalman Filter and the faces are correctly detected. As an important remark, it should be mentioned that the proposed method reached 99% of correct face tracking for the tested videos. The face detector proposed by Viola and Jones [5] was also applied to the same videos used to test our method, and the result is that it failed in every frame in which the person sketched partial profile. It also failed in the frames in which the person looks up or down. The Haar features there used are low cost and effective for frontal face detection, but are not indicated for faces at arbitrary poses. In opposition, the Gabor features used here increase the computational complexity, although still being efficient, but improves meaningfully the performance of the face detector, as the results here reported show. Thus, compared to the cascade detector in [5] (with 32 layers and 4297 features), our method is more efficient to detect multi-view faces. Actually, there are several works in the literature proposing face detection using SVM classifiers. In the work of Heisele [19], the face detector reaches 90% of correct face detection. The authors utilize PCA and Haar features to represent gray face images. In [20] it is used a hierarchy of SVM classifiers, with different resolutions, in order to speed up the overall system, and the method presented 80% of right face detection. Considering the work of Osuna [21], it is reported an index of 97% of correct face detection. However, it was tested only with frontal faces in gray level images. In a more recent work [22], 3 SVM classifiers are trained to detect face in multi-view. An ensemble mechanism (SVM Regression) is introduced to combine the decisions they got from the view-specific SVM classifiers and made the final decisions. The authors report 91% of right face detection. Wang and Ji [23] remarked that in the real world the face poses may vary greatly and many SVMs are needed. They proposed an approach combining cascade and bagging for multi-view face detection. Namely, a cascade of SVMs is first trained through bootstrapping. The remaining positive and negative examples were then randomly partitioned to train a set of SVMs, whose outputs were then combined through majority voting. The method achieved 93% of right face detection.
474
C.J.P. Passarinho, E.O.T. Salles, and M. Sarcinelli-Filho
For DLSVT, the tracking speed is 2 frames per second on a standard PC with a Dual Core CPU, over a Matlab© platform. This frame rate includes all processing tasks, such as image reading, skin and face detection, next target position estimation, in-frame results assignment, plotting the rectangles for the face detected and the position foreseen for the next frame, respectively. As a result of all tests performed, the Dynamic Local Support Vector Machine performs multi-view face detection and tracking in both indoor and outdoor video sequences, with 99% of correct face tracking, thus exhibiting higher performance when compared to other methods available in the literature.
5 Conclusions An efficient face detection and tracking method, the Dynamic Local Support Vector Tracker, is proposed in this paper, which has shown good results working on poor resolution videos, even when the image is influenced by some more realistic effects, like scale, rotation, light changing, partial occlusion, and so on. The skin color region detection shows to be effective to detect regions of the image that could be faces. The contribution of the work to improve arbitrary pose face tracking is the association between face detection using local SVM with next face position estimation based on a Kalman filter. A comparative study with well known face detection methods has also been performed, validating the proposed approach. As future work, a study on the use of different SVM kernels is under development, and the code is being exported to an executable one, to better analyze the computational efficiency of the proposed method.
References 1. Gong, S., McKenna, S., Psarrou, A.: Dynamic Vision from Images to Face Recognition, 1st edn. Imperial College Press, Clarendon (2000) 2. Fröba, B., Ernst, A.: Fast Frontal-View Face Detection Using a Multi-Path Decision Tree. In: Proc. Of Audio and Video based Biometric Person Authentication, Guildford, Uk (June 2003) 3. Liu, C.: A Bayesian Discriminating Features Method for Face Detection. IEEE Trans. on PAMI 25(6), 725–740 (2003) 4. Louis, W., Plataniotis, K.: Frontal Face Detection for Surveillance Purposes using Dual Local Binary Patterns Features. In: Proc. of IEEE International Conference on Image Processing (ICIP), Hong Kong, pp. 3809–3812, September 26-29 (2010) 5. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In: Proc. of CVPR, Crete, Greece, December 8-14 (2001) 6. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting Faces in Images: A Survey. IEEE Transaction on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002) 7. Castañeda, B., Luzanov, Y., Cockburn, J.C.: Implementation of a Modular Real-Time Feature-Based Architecture Applied to Visual Face Tracking. In: Proc. of the 17th International Conference on Pattern Recognition, Cambridge, UK, August 23-26, pp. 167–170 (2004)
Detection and Tracking Faces in Unconstrained Color Video Streams
475
8. Ruan, J., Yin, J.: Face Detection Based on Facial Features and Linear Support Vector Machines. In: Proc. of the International Conference on Communication Software and Networks, pp. 371–375, February 20-22 (2009) 9. Hotta, K.: Adaptive Weighting of Local Classifiers by Particle Filters for Robust Tracking. Pattern Recognition 42(5), 619–628 (2009) 10. Yun, J.-U., Lee, H.-J., Paul, A.K., Baek, J.-H.: Face Detection for Video Summary Using Illumination-Compensation and Morphological Processing. Pattern Recognition Letters 30(9), 856–860 (2009) 11. Pai, Y.T., Ruan, S.J., Shie, M.C., Liu, Y.C.: A Simple and Accurate Color Face Detection Algorithm in Complex Background. In: IEEE International Conference on Multimedia and Expo, Toronto Canadá, July 9-12, pp. 1545–1548 (2006) 12. Gayathri. Face: A Skin Color Matlab Code. Software (2001), http://www.mathworks.com/matlabcentral/fileexchange/ 24851-illumumination-compensation-in-rgbspace?controller=file_infos&download=true 13. Kovac, P., Peer, P., Solina, F.: Human skin colour clustering for face detection. In: EUROCON (2003) 14. Li, S.Z., Jain, A.K.: Handbook of Face Recognition. Springer, Heidelberg (2005) 15. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. submitted to Data Mining and Knowledge Discovery (1998), http://svm.research.bell-labs.com/SVMdoc.html 16. Joachims, T.: Making large-Scale SVM Learning Practical. Advances in Kernel Methods Support Vector Learning. In: Schölkopf, B., Burges, C., Smola, A. (eds.), MIT-Press, Redmond (1999), Software (1999), http://svmlight.joachims.org 17. Bishop, C.: Pattern Recognition and Machine Learning, 1st edn., p. 740. Springer, Heidelberg (2006) 18. HONDA database video, http://vision.ucsd.edu/~leekc/HondaUCSDVideoDatabase/ HondaUCSD.html 19. Heisele, B., Serre, T., Prentice, S., Poggio, T.: Hierarchical Classification and Feature Reduction for Fast face detection with support vector machines. Pattern Recognition 36, 2007–2017 (2003) 20. Romdhani, S., Torr, P., Scholkopf, B., Blake, A.: Computationally Efficient Face Detection. In: Proc. of ICCV, Vancouver, July 7-14, pp. 695–700 (2001) 21. Osuna, E., Freund, R., Girosi, F.: Training Support Vector Machines: An Application to Face Detection. In: Proc. of CVPR, San Juan, Puerto Rico, July 17-19, pp. 130–136 (1997) 22. Yan, J., Li, S., Zhu, S., Zhang, H.: Ensemble SVM Regression Based Multi-View Face Detection System. Technical report, Microsoft Research, MSR-TR-2001-09 (2001) 23. Wang, P., Ji, Q.: Multi-view Face Detection under Complex Scene Based on Combined SVMs. In: Proc. of ICPR, Cambridge, UK, August 23-26 (2004)
Model-Based Chart Image Classification Ales Mishchenko1, and Natalia Vassilieva2 1
CEA, Centre de Grenoble, 17 Martyrs str., 38054 Grenoble Cedex 9, France 2 HP Labs, 1 Artillerijskaya str., 191104, St. Petersburg, Russia
Abstract. Recognition and classification of charts is an important part of analysis of scientific and financial documents. This paper presents a novel model-based method for classifying images of charts. Particularly designed chart edge models reflect typical shapes and spatial layouts of chart elements for different chart types. The classification process consists of two stages. First, chart location and size are predicted based on the analysis of color distribution in the input image. Second, a set of image edges is extracted and matched with the chart edge models in order to find the best match. The proposed approach was extensively tested against the state-of-the-art supervised learning methods and showed high accuracy, comparable to that of the best supervised approaches. The proposed model-based approach has several advantages: it doesn’t require supervised learning and it uses the high-level features, which are necessary for further steps of data extraction and semantic interpretation of chart images.
1
Introduction
Chart images in digital documents are an important source of valuable information that is largely under-utilized for data indexing and information extraction purposes. Classification of images by chart type is an important step in chart image understanding, as it drives the subsequent step of data extraction and semantic interpretation. The major challenge in chart image classification is dealing with variability of the structure, visual appearance and context of the charts belonging to the same type. Structural variability of charts can be illustrated by 2D- and 3D-pie charts with different shapes and a number of sectors: these charts differ significantly by their structure, but are perceived as pie charts by the human eye. Appearance variability corresponds to a variety of color palettes, shadings and fill effects used for the same chart type. Context variability includes variability of chart surroundings, such as annotations, legends, axes, grids, etc. To overcome this challenge we perform a general color and spatial analysis of an input image as a first step of our method, and estimate location and size of chart elements. Basing on the obtained estimations we are able to extract features invariant to chart size, location and orientation. As a second step we use a model-based approach to classify a given chart into one of the predefined types. Currently we support five commonly used chart types: column, bar, pie,
HP Labs contractor during the work on this paper.
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 476–485, 2011. c Springer-Verlag Berlin Heidelberg 2011
Model-Based Chart Image Classification
477
line and area charts. We have designed an edge model for every chart type from the above-mentioned list. The designed models are invariant to inter-class variability of charts and support the data extraction and interpretation step, which follows the classification in the developed system of chart recognition and understanding. The discussion of data extraction and interpretation is out of the scope of the given paper. The classification is performed by looking for the best match between a given chart image and one of the designed chart models. The rest of the paper is organized as follows. After a review of the related work in section 2, our solution to chart image classification is proposed in section 3. The experimental setup is described in section 4, followed by the discussion of the experimental results in section 5. Section 6 concludes the paper.
2
Related Work
Recognition of special types of graphics is an area of intensive research. The survey of diagrams recognition and treatment can be found in [1]. A chart is a type of diagram that graphically represents tabular numeric data, functions or a set of qualitative data. The majority of existing approaches to chart image classification and understanding were developed within the scenario of image features extraction followed by a feature-based comparison. The latter varies from comparison of a test image with training images (Learning-based approach, such as [2]) to comparison of a test image with abstract models, representing particular classes (Model-based approach, such as [3, 4]). Another classification of approaches is by the type of extracted features. According to it, all methods can be divided into the following types: low-level, middle-level and high-level. An example of low-level chart classification is the usage of Hough transform [5–7]. This approach was proved to work well with bar and column charts [7], but it has a number of drawbacks when applied to other chart types [3]. In particular, Hough transform can be ineffective with a large amount of line segments and does not provide connection between image features and chart elements. This makes the subsequent data interpretation step difficult. Examples of middle-level approaches are Multiple-Instance Learning (MIL) [8] and approaches based on shape and spatial relationships of chart primitives, such as [9]. In [8], authors used edge-based features, invariant to translation, rotation and scaling (a number of edges for each edge type; an ordering relationship between the edges; a number of parallel edge pairs and a number of symmetric axes). In [9], the SVMs-based categorization was accomplished by using other middle-level feature sets (region segmentation parameters, curve saliency, histograms of oriented gradients and scale-invariant feature transform). Middle-level approaches can be effective in classification of chart elements, such as rectangles and sectors, but they are unable to reflect the global structure of a chart, such as angular completeness of a pie, radial homogeneity of a pie or correct X-Y correspondence for line/area charts. For example, the point of intersection of curve-plots is locally-similar to a pie. In many cases such locally-similar images can be misclassified.
478
A. Mishchenko and N. Vassilieva
An example of a high-level approach is model-based chart classification described in [3, 4] and learning-based classification described in [10]. The authors detect basic shapes (rectangles, arcs, sectors, line segments, etc) in the image and then compare them to the models of chart elements [3, 4] or use learning-based classifiers [10]. Our approach is similar to [3] and [4], with the difference that we are using models with statistical goodness-of-fit measures for model-matching, and we are using a preprocessing step to detect a chart (or decide that this image does not contain a chart). As a result, our method is designed to process a wider class of images and provide higher accuracy. Chart classification is preceded by the task of chart detection. It employs general image classification methods to distinguish charts from non-chart images. The range of these methods includes SVM [11], blocks statistics and HMM [12], subwindows extraction and supervised learning [13], MIL [14], clustering of image features [15], etc. Special attention to the task of distinguishing some particular charts (line plots) from natural images (photographs) was given in [2].
3
The Proposed Method
Our solution uses a model-based approach to chart image classification. It consists of modeling a number of predefined object classes with abstract models (offline modeling stage) and matching an input image with these models (online classification stage). We exploit widely used “edge structure models” (“edge models” for short) representing geometrical and topological structure of image edges [16, 17]. 3.1
Chart Models
We have designed edge models for pie, column, bar, line and area charts. The edge model is a spatial structure, consisting of line segments and circular/elliptical arcs. It reflects the typical shape and spatial layout of chart elements for a given chart type. Every edge model is provided with a set of goodness-of-fit criteria to measure the discrepancy between the observed image edge set and the edges expected under the model. Pie: The model for 2D/3D pies is a set of line segments – radii, and circular/elliptical arcs with the following constraints: all radii converge at the same point (the center of a pie); the opposite end of every radius is connected to the endpoints of two neighboring radii by arc segments; all radii are equal (2D pie) or the length of radii satisfies the elliptical equation (3D pie); arcs are parts of the same circle (2D pie) or ellipse (3D pie); the center of this circle/ellipse coincides with the center of radii. The goodness-of-fit criteria are: variation of lengths of radii; variation in curvature of arcs; measure of fit of arcs to a single circle/ellipse; distance between the center of a circle/ellipse and the center of radii.
Model-Based Chart Image Classification
479
Column/Bar: The model for columns/bars is a disconnected graph consisting of a set of aligned rectangles. Their alignment determines the coordinate axes (visible or invisible). Size and inter-location constraints are: each rectangle has a base side, lying on the same coordinate axis (the x-axis for column charts and the y-axis for bar charts); the lengths of these sides are equal. The goodness-of-fit criteria are: variation in width of rectangles; variation in alignment: sides orientation, base side location; quality of axes detection: lengths, perpendicularity, alignment with rectangles. Area: The model for areas is a closed polyline, with two segments, parallel to the y-axis. “Bottom” ends of these segments are connected by a single segment, parallel to the x-axis, “top” ends are connected by a chain of line segments, representing a polyline function in respect to the horizontal axis. The goodness-of-fit criteria are: completeness of a polyline and uniqueness of polyline function values; quality of axes detection: lengths, perpendicularity, alignment with area segments. Line: The model for lines is the above mentioned polyline alone. 3.2
Chart Classification
During the classification we first perform analysis of general image features in order to estimate the presence of a chart on an image and spatial parameters of chart elements (preprocessing step). Then we extract image edges and match them with chart edge models in order to find the best match (matching step).
Feature extraction
Feature analysis Chart/Non-chart detection
Chart?
no
Output Not a chart
yes Input Image
3D color histogram with spatial constraints
Size estimation
Spatial constraints refinement
Location estimation
yes
Refine?
no
Output Expected size, color and location of chart elements
Fig. 1. Preprocessing step in the proposed method
Preprocessing The preprocessing step is summarized in Figure 1. We use a 3D color histogram with spatial constraints to perform a general analysis of every image. We define a membership function W : {(x, y)} → [0, 1] as a function of distance from the estimated center location of a chart to a given pixel. W (x, y) = 1 for pixels at the center of a chart, whereas W (x, y) = 0 for
480
A. Mishchenko and N. Vassilieva
pixels outside the estimated chart area. The value for a bin k of a histogram with spatial constraints for an image I(x, y) is calculated as follows. W (x, y), if I(x, y) = colork Hk = . 0, otherwise x,y Thus image pixels outside the chart area do not influence the estimation of the size and location of the chart data components. The process of chart location and size estimation is iterative. At the first iteration, when there is no estimation on chart size and location available, W (x, y) = 1 for all pixels of the image. The histogram structure in terms of peaks and valleys is analyzed in order to estimate size, color and location of chart data components. In a chart image, major peaks in the histogram are usually clearly resolved. The peaks typically correspond to the following elements: background, chart data components, and labeling (legends, axes, tick marks, etc). The peak values provide size estimation for the corresponding chart elements. The highest peak value is considered to represent background, while other major peaks are considered to represent chart data components. Determination of peaks might be less trivial, when chart coloring scheme includes shadows or gradients. In this case, reducing the number of histogram bins (color space quantization) allows of determining the histogram peaks and thus performing size estimation for chart elements for the majority of chart images. The results of size estimation make it possible to take a decision about further processing of an image. If the histogram does not contain clearly separable peaks, or the size of major peaks does not meet the set of predefined heuristic-based constraints, the image is considered to be a non-chart image and the processing of this image is stopped. Otherwise, spatial distribution of colors, corresponding to different peaks, is analyzed in order to estimate the location of data components. Then, the overall chart location and size (coordinates of the center, width and height of a chart) are estimated based on locations and sizes of the chart data components. The estimation of the chart size and location is used to calculate the values of the membership function. In case there are other major regions of uniform color (such as thick boarders, filled text areas, etc), these regions may lead to a wrong estimation of the size of the chart data components. However, if these regions are smaller than the chart data components, their influence is corrected during the next iteration using chart location information (spatial constraints refinement step, shown in Figure 1). Matching the Edge Set to Models. The matching step is summarized in Figure 2. First, an edge set is extracted from an input image by performing edge detection, thinning, linking and vectorization. Edges are detected separately for every color component by Canny edge detector [18] and combined together. Edge thinning is performed by applying the algorithm described in [19]. Lines and arcs are
Model-Based Chart Image Classification
Edge detection, thinning, linking and vectorization
Input Image
481
Sizes, color, locations of chart’s elements
edge set
Matching to column chart model
Matching to pie chart model
Matching to bar chart model
Matching to area chart model Matching to line chart model
consistency with models Voting procedure
Output Chart type or Not-a-chart decision
Fig. 2. Matching step in the proposed method
extracted by edge linking and vectorization by applying the algorithms similar to those described in [20] and [21]. The edges which lie within the estimated chart area and have the size approximately equal to the estimated size of the chart data components are included into the edge set. Second, the obtained edge set is matched to the models of the chart types described in 3.1. Matching the image edge set to the chart models is a process of competitive classification of edges based on their geometrical features into subsets corresponding to the given chart models. For example, to match an observed edge set to the pie model, we are looking for a subset of straight edges converging at the same point (the set of radii) and for a subset of elliptical arcs, maximizing the goodness-of-fit criteria for the pie. We use exhaustive search for this maximization. It is feasible due to the fact that for the majority of chart images only few vectorized edges remain after thresholding by the estimated chart size. Similarly, we match an observed image edge set to all available chart models, which results in a number of image edge subsets, each matching to a particular chart model. The goodness-of-fit criteria are used to measure the discrepancy between the image edge subset and the edges expected under the corresponding chart model. The result of this measurement is the estimation of how close is the observed image edge set to each of the designed chart models. Then a voting procedure is performed based on the values of goodness-of-fit measures leading to classification decision.
4
Experimental Setup
We conducted a set of experiments to evaluate the proposed model-based solution in comparison with the common supervised machine learning methods of classification. To the best of our knowledge, no comparative study of modelbased and learning-based approaches in the context of chart image classification task has been done before.
482
4.1
A. Mishchenko and N. Vassilieva
Methods for Comparison
We used the publicly available WEKA Data Mining Software [22] in our experiments. This software provides implementations of many state-of-the-art machine learning methods, including common baselines for comparative evaluation of classifiers. We conducted experiments with the following methods using their WEKA implementation (categorization and names of the methods are provided according to the WEKA package). bayes: BayesNet, NaiveBayes, NaiveBayesUpdatable. functions: Logistic, RBFNetwork, SimpleLogistic, SMO. meta: AttributeSelectedClissifier (J48 as classifier, CfsSubsetEval for attribute selection), Bagging (REPTree as classifier), ClassificationViaRegression (M5P as classifier), Decorate (J48 as classifier), FilteredClassifier (J48 as classifier), LogitBoost (DecisionStump as classifier), MultiClassClassifier (Logistic as classifier). misc: HyperPipes, VFI. rules: ConjunctiveRule, DecisionTable, JRip, NNge, OneR, PART, Ridor. trees: DecisionStump, J48, LMT, NBTree, REPTree, RandomForest. Default WEKA parameters were used for all methods in the experiments. 4.2
Dataset
The experiments were conducted with a dataset of 980 chart images generated using XML/SWF Charts tool1 . Data for the charts was generated randomly. Images were collected as screenshots, which lead to blur edges and small noise components due to anti-aliasing. In the experiments with learning-based classifiers 33% of the dataset was used for training and the rest was used for testing. 4.3
Experimental Procedure
Every image from the dataset was preprocessed, its edge set was extracted and vectorized as described in section 3.2. This edge set was used by the model-based classifier to predict the type of an image. The same edge set was used to extract features for the learning-based methods. The feature set included statistics on edges, similar to those used in [8]. Line and arc segments from the edge set were grouped by their size (resulting in groups of edges of similar size) and by their connectivity (resulting in groups of connected edges). The feature set consisted of statistical measures of sizes, shapes, inter locations and connections within each group. Parameters of grouping were optimized to provide the highest accuracy for the given dataset. We used category-specific accuracy and average accuracy metrics for evaluating classification decisions. They are calculated as follows. Category-specific Accuracy Ac = 1
http://www.maani.us/xml_charts
T Pc , T Pc + F Nc
Model-Based Chart Image Classification
Average Accuracy Aavg =
483
1 Ac , m c∈C
where T Pc – the number of True Positives, F Nc – the number of False Negatives with respect to a specific category c ∈ C ≡ {c1 , . . . , cm }.
5
Results
The experimental results of the model-based classification method are summarized in Table 1. Table 1. Confusion matrix for the model-based classification Predicted value Num. of line area column pie images line 191 (100%) 0 0 0 191 Actual area 9 191 (96%) 0 0 200 value column 2 5 191 (96%) 0 198 2D pie 34 17 0 142 (74%) 193 3D pie 3 27 0 168 (85%) 198
The average accuracy for model-based chart classification is 90% according to the experimental results. The maximum category-specific accuracy is obtained for line chart images at the level of 100%, the lowest category-specific accuracy is obtained for pie charts (74% for 2D pies and 84% for 3D pies). Other papers on chart classification report close results (although using a different chart image database): 83% average accuracy reported in [9] (with maximum of 90% for column images), 76% average accuracy reported in [8]. Learning-based methods demonstrated performance, which is comparable with the results of the proposed model-based solution. The best results among the learning-based methods were shown by LogitBoost, Filtered, BayesNet and NNge with more than 90% of recall for pies, columns/bars, lines and with recall of 69%77% for areas (see Fig. 3).
6
Conclusions
In this paper, we propose a novel approach that performs model-based chart image classification. The proposed model-based classifier does not need supervised learning and relies on edge models of chart types, which contain structural information about typical edges in images of a given chart type. The proposed approach was extensively tested against a number of supervised learning approaches and showed comparable accuracy. Implementation of the proposed method can be parallelized in feature extraction as well as in model-matching steps.
484
A. Mishchenko and N. Vassilieva
Best statistics-classifiers for pie,column, line and are a (70%) 1 0,9 0,8 Accuracy
0,7 0,6 0,5 0,4 0,3 0,2
pie
0,1
column
line
area
ec D
Si m
pl
isi
eL
on
Ta b
og ist
le
ic
et N ye s
EP R
Ba
Tr e
e
T LM
gi tB Lo
Fi
lt e
re d
oo st
0
Fig. 3. Results of the best learning-based methods
The main advantage of the proposed approach is that it is based on highlevel features, that are useful for further high-level interpretation of charts and extraction of numerical data. It performs chart image classification and data component detection in one-pass. Another advantage is that model-based classification does not need supervised learning. Due to variability of design among charts of the same type, training samples may contradict each other (as some examples from [9]) and a training set might never be complete. The main drawback of the proposed method, common for all model-based approaches, is necessity to involve humans to design a model. However, in case of charts, adding a new model to the classifier may require just a small amount of operator’s work. This simplicity of chart models makes the model-based approach to be one of the preferable directions in chart image classification. In current implementation of the proposed solution, all chart models (column, bar, pie, line and area) are hard-coded. In the future a simple description language can be proposed for chart models, so that an operator will be able to add new chart types without changing the code.
References 1. Blostein, D., Lank, E., Zanibbi, R.: Treatment of diagrams in document image analysis. In: Proceedings of the International Conference on Theory and Application of Diagrams, pp. 330–334 (2000) 2. Lu, X., Kataria, S., Brouwer, W.J., Wang, J.Z., Mitra, P.: Automated analysis of images in documents for intelligent document search. International Journal on Document Analysis and Recognition 12 (2009) 3. Huang, W., Tan, C.-L., Leow, W.-K.: Model-based chart image recognition. In: Llad´ os, J., Kwon, Y.-B. (eds.) GREC 2003. LNCS, vol. 3088, pp. 87–99. Springer, Heidelberg (2004)
Model-Based Chart Image Classification
485
4. Huang, W., Tan, C.L., Leow, W.K.: Elliptic arc vectorization for 3d pie chart recognition. In: ICIP, pp. 2889–2892 (2004) 5. Zhou, Y.P., Tan, C.L.: Hough technique for bar charts detection and recognition in document images. In: International Conference on Image Processing, vol. 2, pp. 605–608 (2000) 6. Zhou, Y.P., Tan, C.L.: Learning-based scientific chart recognition. In: 4th IAPR International Workshop on Graphics Recognition, pp. 482–492 (2001) 7. Zhou, Y.P., Tan, C.-L.: Bar charts recognition using hough based syntactic segmentation. In: Anderson, M., Cheng, P., Haarslev, V. (eds.) Diagrams 2000. LNCS (LNAI), vol. 1889, pp. 494–497. Springer, Heidelberg (2000) 8. Huang, W., Zong, S., Tan, C.L.: Chart image classification using multiple-instance learning. In: WACV, p. 27 (2007) 9. Prasad, V.S.N., Siddiquie, B., Golbeck, J., Davis, L.S.: Classifying computer generated charts. In: Workshop on Content Based Multimedia Indexing, pp. 85–92 (2007) 10. Shao, M., Futrelle, R.P.: Recognition and classification of figures in PDF documents. In: Liu, W., Llad´ os, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 231–242. Springer, Heidelberg (2006) 11. Chapelle, O., Haffner, P., Vapnik, V.: Support vector machines for histogram-based image classification. IEEE Trans. Neural Netw. 10(5), 1055–1064 (1999) 12. Li, J., Najmi, A., Gray, R.M.: Image classification by a two-dimensional hidden markov model. IEEE Trans. Signal Process. 48(2), 517–533 (2000) 13. Maree, R., Geurts, P., Piater, J., Wehenkel, L.: Random subwindows for robust image classification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 34–40 (2005) 14. Yang, C., Dong, M., Fotouhi, F.: Region based image annotation through multipleinstance learning. In: Proceedings of the ACM International Conference on Multimedia, pp. 435–438 (2005) 15. Li, J., Wang, J.: Real-time computerized annotation of pictures. In: Proceedings of the ACM International Conference on Multimedia, pp. 911–920 (2006) 16. Biederman, I.: Human image understanding: Recent experiments and a theory. In: Computer Vision, Graphics and Image Processing, vol. 32, pp. 29–73 (1985) 17. Mundy, J.L., Heller, A.: The evolution and testing of a model-based object recognition system. In: ICCV, pp. 268–282 (1990) 18. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Analysis and Machine Intelligence 8, 679–698 (1986) 19. Kumar, P., Bhatnagar, D., Rao, P.S.U.: Pseudo one pass thinning algorithm. Pattern Recognition Letters 12, 543–555 (1991) 20. Liu, S.M., Lin, N.C., Liang, C.C.: An iterative edge linking algorithm with noise removal capability. In: Proceedings of ICPR, pp. 1120–1122 (1988) 21. Song, J., Su, F., Chen, J., Tai, C.L., Cai, S.: Line net global vectorization: an algorithm and its performance analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 383–388 (2000) 22. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)
Kernel-Based Motion-Blurred Target Tracking Yi Wu1,2,3, , Jing Hu5 , Feng Li4 , Erkang Cheng3 , Jingyi Yu4 , and Haibin Ling3 1
3
Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing, 210044 2 School of Computer & Software, Nanjing University of Information Science & Technology, Nanjing, 210044 Center for Information Science and Technology, Computer and Information Science Department, Temple University, Philadelphia, PA, USA {wuyi,hbling,tuc33610}@temple.edu 4 Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA {feli,yu}@cis.udel.edu 5 Network Center, Nanjing University of Information Science & Technology, Nanjing, 210044 [email protected]
Abstract. Motion blurs are pervasive in real captured video data, especially for hand-held cameras and smartphone cameras because of their low frame rate and material quality. This paper presents a novel Kernelbased motion-Blurred target Tracking (KBT) approach to accurately locate objects in motion blurred video sequence, without explicitly performing deblurring. To model the underlying motion blurs, we first augment the target model by synthesizing a set of blurred templates from the target with different blur directions and strengths. These templates are then represented by color histograms regularized by an isotropic kernel. To locate the optimal position for each template, we choose to use the mean shift method for iterative optimization. Finally, the optimal region with maximum similarity to its corresponding template is considered as the target. To demonstrate the effectiveness and efficiency of our method, we collect several video sequences with severe motion blurs and compare KBT with other traditional trackers. Experimental results show that our KBT method can robustly and reliably track strong motion blurred targets.
1
Introduction
Object tracking is one of the most important tasks within the field of computer vision. It plays an important role in many applications, such as surveillance, robotics, human computer interaction, and medical image analysis [17]. Most previous work on object tracking have focused on robustly handling noise [13], illumination [1,15], and occlusions [11,14]. A common assumption in these algorithms is that the video frames are blur-free. With the prevalence of cheap
This work was done when Yi Wu was with Temple University.
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 486–495, 2011. c Springer-Verlag Berlin Heidelberg 2011
Kernel-Based Motion-Blurred Target Tracking
487
consumer cameras and smartphone cameras, this assumption is not valid for most of the video data captured using these devices, due to the low frame rate, fast motion of the target and/or hand-shake. Because the visual features of the target and the observation models of trackers are destroyed, this degradation in appearance makes the target inference very challenging in motion blurred sequences. An extensive literature exists deblurring, visual tracking and motionblurred target tracking. Debulrring. Intuitively, we could handle severe motion blurs in visual tracking by explicitly deblurring each frame. Previous approaches are usually based on regularization [16], image statistics [5,8], edge priors [9]. Recently sparse representation is applied to deblurring [10,3]. Since image deconvolution is a highly ill-posed problem, the latent image reconstructed would have many visual artifacts, such as ringing effects, which destroy the visual features of the target and complicate the object tracking process. Moreover, the deblurring process is computationally expensive and therefore not suitable for real-time visual tracking tasks. Visual tracking. Many tracking algorithms have been proposed to overcome the tracking difficulties, such as occlusion, background clutter, and illumination changes. In [2], the mean-shift algorithm was adopted to find the optimal location for the target. Isard and Blake [6] treat tracking as a state sequence estimation problem and use the sequential Bayesian inference coupled with Monte Carlo sampling for the solution. P´erez et.al. [12] proposed to integrate the HSV color histogram into the sequential Bayesian inference tracking framework. Motion-blurred target tracking. The motion-blurred target tracking problem was first addressed in [7] and then further investigated in [4]. In [7], the blurred target regions are estimated by computing the matching score in terms of the region deformation parameters and motion vectors, and then a local gradient descent technique is employed to find the optimal solution. Jin et. al. [7] assume that the blurred target appears highly coherent in the video sequence and the motion between frames is relatively small. In [4], mean-shift tracker with motion-blurred temples is adopted for motion-blurred target tracking. Although our KBT and [4] share some similarity in using the mean shift tracking with blurred templates, our method has several advantages: 1) [4] has to do local blur classification before handle local motion blurs, while our KBT method not only does not need blur classification but also can effectively deal with both local blurs and global blurs. 2) Our KBT method does not need the off-line training process for blur estimation, while in [4] they have to collect and align a large amount of blurred and non-blurred patches for complicated SVM training. They also suffer from an ambiguous problem about homogeneous regions from the training set. 3) Our KBT does not need a blur direction estimation process, while [4] needs the steerable filter to estimate the blur direction. In this paper we present a novel Kernel-based motion-Blurred target Tracking (KBT) approach without explicitly performing deblurring. Our method incorporates the blur templates into the appearance space to model the blur degra-
488
Y. Wu et al.
dations. Specifically, to model the underlying blurs , we augment the the target model by synthesizing various blurred templates of the target with different blur directions and strengths. We represent the templates using color histograms which are regularized by spatial masking with an isotropic kernel. Then we adopt the mean shift procedure to find the optimal the location optimization for each template iteratively. Finally, the optimized region with maximum similarity to its corresponding template is considered as the target. To evaluate our method, we have collected several video sequences with significant motion blurs. We tested the proposed approach on these sequences and observed promising tracking performances in comparison with several other trackers. The rest of the paper is organized as follows. In the next section the kernelbased tracking approach is reviewed. After that, the blur modeling approach is proposed in Section 3. Experimental results are reported in Section 4. We conclude this paper in Section 5.
2
Kernel-Based Tracking
Kernel-based tracking [2] has been proved to be very efficient. Inspired by this work, we use the mean shift procedure to optimize the target location. To handle the blur effects in the target’s appearance, we introduce blur templates to augment the template set. This expanded set is useful to handle the underlying blurs. 2.1
Target Representation
To characterize the target, the target model is represented by its pdf (m-bin color histogram) q in the feature space. In the subsequent frame, a target candidate at location y is characterized by the pdf p(y). mThus, the target model ˆu = 1 and p ˆ (y) = and candidateare represented by q ˆ = {ˆ qu }m u=1 , u=1 q m m {ˆ pu (y)}u=1 , ˆu = 1, respectively. u=1 p A similarity function between p ˆ and q ˆ is denoted by ρˆ (y) ≡ ρ [ˆ p (y) , q ˆ], whose local maxima in the image indicate the presence of objects, having representations similar to target model. To find the maxima of such functions, gradient-based optimization procedures are difficult to apply and only an expensive exhaustive search can be used. In the kernel-based tracking [2], the similarity function is regularized by masking the objects with an isotropic kernel in the spatial domain. The kernel weights carry continuous spatial information. When the kernel weights are used in defining the feature space representations, ρˆ(y) becomes a smooth function in y. Thus, gradient-based optimization procedures can be applied to search the target location efficiently. 2.2
Kernel Regularization
A differentiable kernel profile, k(x), yields a differentiable similarity function and efficient gradient-based optimizations procedures can be used for finding its
Kernel-Based Motion-Blurred Target Tracking
489
maxima. An isotropic kernel can assign smaller weights to pixels farther from the center. Due to the peripheral pixels are less reliable and often affected by clutters, using these weights increases the robustness of the density estimation. Let {xi }ni=1 be the pixel locations of the target model, centered at 0. Let b(xi ) be the bin index of the pixel at location xi in the quantized feature space. The probability of the feature u in the target model is then computed as qˆu = C
n 2 k xi δ [b (xi ) − u] ,
(1)
i=1
where δ is the Kronecker delta function and the normalization constant C is C= n i=1
1 2 k xi
Let the center of the target candidate is at location y in the current frame. Using the same kernel profile k(x), but with bandwidth h, the probability of the feature u in the target candidate is given by nh y − xi 2 δ [b (xi ) − u] k pˆu (y) = Ch h i=1
where Ch =
nh i=1
1
y−x 2 k h i
is the normalization constant. 2.3
Bhattacharyya Metric
In the kernel-based tracking, Bhattacharyya metric is adopted to accommodate comparisons among various targets. The distance between two discrete distributions is defined as d(y) = 1 − ρ [ˆ p (y) , q ˆ] (2) where ρˆ (y) ≡ ρ [ˆ p (y) , q ˆ] =
m pˆu (y)ˆ qu u=1
is the sample estimate of the Bhattacharyya coefficient between p and q. 2.4
Target Localization
To find the location corresponding to the target in the current frame, the distance (2) should be minimized as a function of y. The localization procedure starts from the position of the target in the previous frame and searches in the
490
Y. Wu et al.
Algorithm 1. Kernel-based tracking 1: Given: The target model {ˆ qu }m ˆ0 in the previous frame. u=1 and its location y 2: Initialize the location of the target in the current frame with y ˆ0 , compute y0 )}m . {ˆ pu (ˆ u=1 h 3: Derive the weights {wi }n i=1 according to (4). 4: Use (3) to get the new location y ˆ1 of the target candidate. y1 )}m . 5: Compute {ˆ pu (ˆ u=1 ˆ0 < ε Stop. Otherwise Set y ˆ0 ← y ˆ1 and go to Step 2. 6: If ˆ y1 − y
neighborhood. Since the distance function is smooth, the procedure uses gradient information provided by the mean shift vector. The mode of this density in the local neighborhood can be found by employing the mean shift procedure, where the kernel is recursively moved from the current ˆ1 according to location y ˆ0 to the new location y
nh yˆ −x 2 xi wi g 0 h i y ˆ1 = i=1 (3)
nh yˆ −x 2 wi g 0 h i i=1
where
g(x) = −k (x) , m qˆu δ [b(xi − u)] wi = p ˆ y0 ) u (ˆ u=1 The complete target localization algorithm is presented in Algorithm 1. In our implementation, kernel with Epanechnikov profile
1 −1 c (d + 2)(1 − x) if x ≤ 1 k(x) = 2 d 0 otherwise
(4)
(5)
is used. In this case, the derivative of the profile, g(x), is constant and (3) is reduced to a simple weighted average: nh xi wi (6) y ˆ1 = i=1 nh i=1 wi
3
Blur Modeling
To model the underlying blurs for visual tracking, the target model is augmented by synthesizing various blurred templates of the target with different blur directions and strengths. Let I and Ib be the blur-free and blurred image of a tracking target, respectively. Ib can be modeled as convolving I with a Gaussian blur kernel kv ,
Kernel-Based Motion-Blurred Target Tracking
491
Algorithm 2. KBT tracking 1: Given: The target model set {ˆ qn } N ˆn = {ˆ qun }m ˆ in n=1 , where q u=1 and its location y the previous frame. 2: for each target model n do 3: Using algorithm 1 to search its optimized target location y ˆn m n 4: Set the likelihood ρn = pˆu (ˆ yn )ˆ qu u=1
5: end for 6: Find the maximum value and corresponding index n∗ of {ρn }N n=1 7: Set the current target location to be y ˆ=y ˆn∗ .
Ib (p) = kv ⊗I(p), where vector v encodes both the direction and the magnitude of the motion. Since the kernel kv is symmetric, the motion blur kernel kv is therefore equivalent to k−v . To capture different blur effects, the manually selected blur-free target template t in the first frame is convolved with various blur kernels to generate blurred templates. Let the potential motion blurs are governed by the parameter pair θ and l, where θ is used for the motion direction and l for speed. In our implementation, nθ = 8 different directions Θ = {θ1 , · · · , θnθ } and nl = 8 different speeds L = {l1 , · · · , lnl } are used. Thus, we have nb = nθ × nl blur kernels {kθ,l : θ ∈ Θ, l ∈ L} and the (i, j)th blur template is defined as ti,j = t ⊗ kθi ,lj . Consequently, the target template set is augmented from one single template to N = nb + 1 templates. m ˆ n = {ˆ For each template, kernel regularized color histogram q qun }u=1 is extracted according to (1). Then the mean shift procedure is adopted to perform the location optimization for each template. Finally, the optimized region with maximum similarity to its corresponding template is considered as the target. The complete blurred target localization algorithm is presented in Algorithm 2.
4
Experiments
Our KBT tracker was applied to many sequences. Here, we just present some representative results. In all the sequences, motion blurs are severe and result in the blending of the adjacent colors. We use the Epanechnikov profile for histogram computations and the mean shift iterations were based on weighted average (6). We compared the proposed KBT algorithm with other traditional trackers: Mean Shift tracker (MS) [2] and Color-based Particle Filtering tracker (CPF) [12]. All the three trackers adopt the RGB color space as feature space which is quantized into 16 × 16 × 16 bins. In our experiments, for each tracker we used the same parameters for all of the test sequences. We first test our algorithm on the sequence owl. The target in sequence owl is a plane object, which is frequently and severely blurred. Fig. 1 shows a sampling tracking results using different schemes on the owl sequence. We can see that
492
Y. Wu et al.
Fig. 1. Tracking comparison results of different algorithms on sequence owl (#22, #54, #68, #117, #151). Three examples of CPF, MS and KBT are shown in the rows from top to bottom respectively.
Fig. 2. Tracking comparison results of different algorithms on sequence face (#64, #77, #89, #152, #170). Three examples of CPF, MS and KBT are shown in the rows from top to bottom respectively.
when target moves fast and blurs severely, MS and CPF trackers could not follow it. While our proposed KBT can track the target throughout the sequence. The image results of face are illustrated in Fig. 2. Our proposed KBT achieves better results than the other two tracker. Fig. 3 illustrates the tracking results in sequence body. The target is moving and is severely blurred. Again, our tracker successfully tracks the target throughout the sequence.
Kernel-Based Motion-Blurred Target Tracking
493
Fig. 3. Tracking comparison results of different algorithms on sequence body (#161, #163, #216, #240, #241). Three examples of CPF, MS and KBT are shown in the rows from top to bottom respectively. 0.5
0.2
0.4
0.15
0.3
Error
Error
0.25
0.1
0.2
0.05
0.1
0
20
40
60
80 # owl
100
120
(a)
140
0
50
100 # face
150
(b)
0.6 0.5
Error
0.4 0.3 0.2 0.1 20
40
60
80
# body
(c) Fig. 4. The tracking error plot for each sequence we tested on. The error is measured using the Euclidian distance of two center points, which has been normalized by the size of the target from the ground truth. Blue: CPF, Green: MS, Red: KBT.
494
Y. Wu et al.
For all the sequences, we manually labeled the ground truth bounding box of the target in each frame for quantitative evaluation. The error is measured using the Euclidian distance of two center points, which has been normalized by the size of the target from the ground truth. Fig. 4 illustrates the tracking error plot for each algorithm. From this figure we can see that although all the compared tracking approaches cannot track the blurred target well, our proposed KBT can track the blurred target robustly. The reason that KBT performs well is that KBT uses blur templates to model the underlying blurs. This improves the appearance representation in the presence of motion blurs.
5
Conclusion
We have presented a novel kernel-based tracker for tracking motion-blurred targets. KBT achieves this challenging tracker task without performing deblurring. Specifically, the target model is augmented by synthesizing various blurred templates of the target with different blur directions and speeds to model the underlying blurs. Each template is represented by a kernel regularized color histogram. Then the mean shift procedure is adopted to perform the location optimization for each template. Finally, the optimized region with maximum similarity to its corresponding template is considered as the target. Experimental results on several challenging video sequences have shown that KBT can robustly track motion-blurred targets and outperforms others traditional trackers. Acknowledgment. This work is supported in part by NSF Grants IIS-0916624 and IIS-1049032. Wu is supported in part by National Natural Science Foundation of China (Grant No. 61005027) and Priority Academic Program Development of Jiangsu Higher Education Institutions.
References 1. Badrinarayanan, V., P´erez, P., Clerc, F.L., Oisel, L.: Probabilistic Color and Adaptive Multi-Feature Tracking with Dynamically Switched Priority Between Cues. In: IEEE International Conference on Computer Vision, ICCV (2007) 2. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 25, 564–577 (2003) 3. Cai, J., Ji, H., Liu, C., Shen, Z.: Blind motion deblurring from a single image using sparse approximation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2009) 4. Dai, S., Yang, M., Wu, Y., Katsaggelos, A.: Tracking Motion-Blurred Targets in Video. In: IEEE International Conference on Image Processing, ICIP (2006) 5. Fergus, R., Singh, B., Hertzmann, A., Roweis, S., Freeman, W.: Removing camera shake from a single photograph. ACM T. on Graphics, SIGGRAPH (2006) 6. Isard, M., Blake, A.: Condensation-Conditional Density Propagation for Visual Tracking. International Journal of Computer Vision (IJCV) 29, 5–28 (1998)
Kernel-Based Motion-Blurred Target Tracking
495
7. Jin, H., Favaro, P., Cipolla, R.: Visual Tracking in the Presence of Motion Blur. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2005) 8. Levin, A.: Blind motion deblurring using image statistics Advances. In: Advances in Neural Information Processing Systems, NIPS (2007) 9. Levin, A., Fergus, R., Durand, F., Freeman, W.: Image and depth from a conventional camera with a coded aperture. ACM T. on Graphics, SIGGRAPH (2007) 10. Lou, Y., Bertozzi, A., Soatto, S.: Direct Sparse Deblurring. Int’l. J. Math. Imaging and Vision (2010) 11. Mei, X., Ling, H., Wu, Y., Blasch, E., Bai, L.: Minimum Error Bounded Efficient 1 Tracker with Occlusion Detection. In: CVPR (2011) 12. P´erez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 661–675. Springer, Heidelberg (2002) 13. Wu, Y., Wu, B., Liu, J., Lu, H.Q.: Probabilistic Tracking on Riemannian Manifolds. In: IEEE International Conference on Pattern Recognition, ICPR (2008) 14. Wu, Y., Wang, J.Q., Lu, H.Q.: Robust Bayesian tracking on Riemannian manifolds via fragments-based representation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP (2009) 15. Wu, Y., Cheng, J., Wang, J.Q., Lu, H.Q.: Real-time visual tracking via incremental covariance tensor learning. In: IEEE International Conference on Computer Vision, ICCV (2009) 16. Richardson, W.: Bayesian-Based Iterative Method of Image Restoration. Journal of the Optical Society of America (JOSA) 62, 55–59 (1972) 17. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing Surveys 38(4) (2006)
Robust Foreground Detection in Videos Using Adaptive Color Histogram Thresholding and Shadow Removal Akintola Kolawole and Alireza Tavakkoli University of Houston-Victoria
Abstract. Fundamental to advance video processing such as object tracking, gait recognition and video indexing is the issue of robust background and foreground segmentation. Several methods have been explored regarding this application, but they are either time or memory consuming or not so efficient in segmentation. This paper proposes an accurate and fast foreground detection technique for object tracking in videos with quasi-stationary backgrounds. The background is modeled using a novel real-time kernel density estimations approach based on online histogram learning. It is noted that shadows are classified as part of foreground pixels if further processing on illumination conditions of the foreground regions is not performed. A developed morphological approach to remove shadows from the segmented foreground image is used. The main contribution of the proposed foreground detection approach is its low memory requirements, low processing time, suitability for parallel processing, and accurate segmentation. The technique has been tested on a variety of both indoor and outdoor sequences for segmentation of foreground and background. The data is structured in such a way that it could be processed using multi-core parallel processing architectures. Tests on dual and quad core processors proved the two and four times speed up factors achieved by distributing the system on parallel hardware architectures. A potential direction for the proposed approach is to investigate its performance on a CUDA enabled Graphic Processing Unit (GPU) as parallel processing capabilities are built into our architecture.
1
Introduction and Literature Review
Advance video processing requires an accurate segmentation of foreground objects. Therefore detecting regions of moving objects in scenes such as cars, people are the first basic requirements of every vision system. Due to different nature of different backgrounds, some static, some dynamic and some quasi-stationary, motion detection poses a difficult problem. A variety of algorithms have been proposed in the literature to suit scenarios in different environments. Several methods have been adopted in foreground detection of moving objects. A prominent category of such techniques is the class of parametric background subtraction algorithms in these methods each background pixel is modeled using a single uni-modal probability density function. In this class we have the running G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 496–505, 2011. c Springer-Verlag Berlin Heidelberg 2011
Robust Foreground Detection in Videos
497
Gaussian average. The background model consists of the probabilistic modeling of each pixel value using Gaussian probability function (p.d.f.), characterized by its mean μ and variance σ. It is a recursive algorithm where a Gaussian density function is fitted for each pixel [1]. Considering dynamic nature of the background pixels, Mixture of Gaussians (MoG) is another parametric algorithm that has been used to model background [2]. A single Gaussian density function for each pixel is not adequate to model backgrounds with non-stationary background objects, such as raining, waving objects, fluctuating lights, etc.. The algorithm main idea is to be able to model several background objects found in the background for each pixel. A mixture of n Gaussian density functions is used [3]in conjunction with the K-Means clustering. However, the clustering technique and the number of clusters are different depending on their applications. In this approach the pixel intensity distribution is analyzed by the Gaussian Mixture Models. Based on the intensity of background cluster and foreground cluster, the Gaussian distribution is divided into two clusters by K-Means clustering technique. The intensities in the cluster with the maximum member are averaged and used as the background model. The foreground is extracted by using background subtraction technique. In [4], adaptive background subtraction in dynamic Environments is achieved using fuzzy logic. This approach uses is a methodology to detach moving objects being. To increase the certainty of moving object detachment and detection, [4], proposes an algorithm for object segmentation based on a fuzzy logic inference system. Using the fuzzy inference system to detach moving objects erodes the object due to misclassification , morphological operations and neighborhood information to repair the missing parts. Kernel density estimation (KDE) [5] is an example of non-parametric methods, proposed to solve the parameter selection problem of MoG and the other parametric approaches. However, in the presence of a dynamic scene, the background cannot be accurately modeled with a set of Gaussians. KDE overcomes the problem by estimating background probabilities at each pixel from many recent samples using kernel density estimation. The problem of KDE approach however is the modeling of background and foreground separately which will require additional overhead in processing time and memory requirements. KDE is used in a motion based foreground detection mechanism in [6]. However, the calculation of optical flow may result in inaccurate object detection in case of glare, shadow, and highlights. Temporal median filter [7] is another algorithm in which the background is estimated for each pixel as the median of all the recent values. The method has been reported to perform better than Running Gaussian and a faster version of the algorithm is reported in [8]. In the rest of the paper, section 2 presents the proposed approach for robust foreground detection in videos with quasi-stationary backgrounds. In section 3 qualitative and quantitative experimental results are discussed. Section 4 concludes the paper and discusses potential future directions for this research.
498
2
A. Kolawole and A. Tavakkoli
Methodology
The main contribution of this paper is the introduction of a novel Kernel-based density estimation with adaptive thresholds in a parallel processing architecture. The goal in background subtraction is to separate background areas of the image from foreground regions of motion that are of interest for advance vision processing such as tracking. In this paper, we make the fundamental assumption that the background will remain stationary with possibility of inherent background changes, such as waving flag, fluctuating lights, etc. This assumption necessitates that the camera be fixed and that global lighting does not change suddenly. Segmenting moving objects in still camera video frames is done in three stages in the proposed method. The first step is the Histogram computation followed by the threshold calculation phase and finally the foreground segmentation. 2.1
Kernel Density Estimation
Kernel density estimation (KDE) is the most used and studied nonparametric density estimation method. The model is the reference dataset, containing the reference points indexed natural numbered. In addition, assume a local kernel function centered upon each reference point, and its scale parameter (the bandwidth). The common choices for kernels include the Gaussian: and the Epanechnikov kernel [5]. The Gaussian uses the following format: d 1 (1) KN = (2π)− 2 exp − x2 2 While the Epanechnikov kernel is given by: 1 −1 c (d + 2)(1 − x2 ) if x < 1 KE = 2 d 0 otherwise
(2)
Let x1 , x2 , · · · , xN be a random sample taken from a continuous, univariate density f . The kernel density estimator is given by, the Epanechnikov kernel is given by: N (x − xi ) 1 } (3) KE { fˆ(x, h) = N × h i=1 h KE is the function satisfying KE (x)dx = 1. The function K is referred to as the Kernel and h is a positive number, usually called the bandwidth or window width. 2.2
Histogram Computation
In this stage of the process a number of initial frames – N – in the video sequence (called learning frames) are used to build stable distributions of the pixel
Robust Foreground Detection in Videos
499
RGB means. The RGB intensities of each pixel position is accumulated for the N frames and the cumulative sum of the average intensities i.e (sum of (RGB)/3) are computed over the learning frames. Notice that the learning frames contain only the background of the video. A histogram of 256 bins is constructed using these pixel average intensities over the training frames. The sum is then normalized to 1.
Fig. 1. The histogram of typical pixel location
Figure 1 shows a typical unimodal histogram calculated for a pixel in a dark area of the video frames. The x-axis in the graph is the bin intensity value and the y axis indicates the probability that each histogram bin belongs to the background model. 2.3
Threshold Calculation
Threshold is a measure of the minimum portion of the data that should be accounted for by the background. For more accuracy in our segmentation, we use different threshold for each histogram bins. The pseudo- code for the Threshold calculation is given below: The pseudo-code for the adaptive threshold calculation 1 For each H[i] 2 Get sum of H[i] 3 Peak[i]=max(H[i]) 4 Pth[i]=Peak[i]/2 5 Calculate sum2(H[i] > Pth[i]) 6 If(sum2(H[i] > Pth[i]) is less than 0.95 of sum of Hi 7 Pthi=Peak[i]/2 8 go to 5 9 else 10 threshold=Pth[i] 2.4
Foreground/Background Detection
For every pixel observation, classification involves determining if it belongs to the background or the foreground. The first few initial frames in the video sequence
500
A. Kolawole and A. Tavakkoli
(called learning frames) are used to build histogram of distributions of the pixel means. No classification is done for these learning frames. Classification is done for subsequent frames using the process given below. Typically, in a video sequence involving moving objects, at a particular spatial pixel position a majority of the pixel observations would correspond to the background. Therefore, background clusters would typically account for much more observations than the foreground clusters [9]. This means that the probability of any background pixel would be higher than that of a foreground pixel. The pixel are ordered based on their corresponding value of the histogram bin. Based on the adaptive threshold calculated in section 2.3, the pixel intensity value for the subsequent frames are observed. The corresponding histogram bin is located within the histogram and the bin value corresponding to this intensity are determined. The classification occurs according to the following condition: The pseudo-code for the thresholding program FGBG_DET (Vij,TH) 1 If(Vij |
2.5
Shadow Removal Using Color-Based Detection
One of the major challenges in this area has been segmenting shadows with the foreground. Tracking is one of the areas where presence of shadow can lead to
Robust Foreground Detection in Videos
501
(b)
(a)
Fig. 2. Distortion measurements in the RGB color space. (a) The background pixels Back and (b) The foreground pixels Fore in equation (4).
significant drift. We adopted the shadow removal technique as postulated by [10]. Assuming that the irradiation consists only of white light, the chromaticity in a shadowed region should be the same as when it is directly illuminated [11]. Landabaso et al. in [10] indicate that ”Important physical properties of the surface in color vision are surface spectral reflectance properties, which are invariant to changes of illumination, scene composition or geometry. On Lambertain, or perfectly matte surfaces, the perceived color is the product of illumination and surface spectral reflectance.” Therefore, the background and foreground pixels can be decomposed into brightness and chromaticity components. Based on the fact that both brightness and chromaticity are very important, a good distortion measure between foreground and background pixels has to be decomposed into its brightness and chromaticity components as in [10]. Both measures are shown in Figure 2 as formulated in equation 4, [10].
−−−→ −−−→ −−→ −−→ BD = arg min Fore − αBack · Fore − αBack (4) −−−→ −−→ CD = Fore − αBack −−→ Fore denotes the RGB value of a foreground pixel in the incoming frame which −−−→ has been classified as foreground (i.e. Figure 2(a) while Back is that of its background counterpart (i.e. Figure 2(b)). α represents the pixel’s strength of brightness with respect to the expected value. The brightness distortion can be obtained by: BD =
−−→ −−−→ Fore · Back −−−→ Back2
(5)
The following relationship holds between α values and the pixel brightness in the current image with respect to the reference image: ⎧ ⎨ = 1 pixel brightness in current image is the same as the reference image α < 1 pixel brightness in current image is darker than the reference image ⎩ > 1 pixel brightness in current image is brighter than the reference image (6)
502
A. Kolawole and A. Tavakkoli
Finally, a set of thresholds can be defined to assist the classification into foreground, highlighted or shadowed pixel. The pseudo-code for the shadow removal program Shadow_Removal () 1 IF CD<10.0 2 IF 0.5
A Parallel Processing Architecture for Background Modeling
The proposed approach follows an architecture suitable for parallel processing. The process of creating pixel models and histograms designed in the proposed framework does not affect the models for other pixels. As a result the background model is not affected by processing pixel models out of order. Therefore, the proposed method has the potential to be employed by massive parallel processing units such as Graphics Processing Units (GPU)s. In order to investigate the performance of the proposed pixel model generation technique, it was tested in dual and quad processor architectures using the OpenMp API in C++. OpenMP allows the code suitable for parallel processing to be split into multiple threads running on multiple processors. In the case of dual and quad core processors, using OpenMP, we created two and four threads, respectively. Each thread employs the proposed background modeling technique for one pixel at a time until the pixel model is calculated. Once the pixel model is finalized the thread starts modeling another pixel.
Fig. 3. Pixel modeling being generated with two threads on a dual core processors
Figure 3 shows a sketch of the architecture using two processors to generate pixel models. Notice that employing two physical processors doubles the speed of creating background pixel models. Similarly, with a quad core processor the speed will improve by a factor of four. The main reason that the proposed approach is easily suited with such multi-processor parallel processing architectures
Robust Foreground Detection in Videos
503
is the fact that the model training mechanism proposed in this paper works independently on pixels regardless of their spatial locations in the frame. Such shared memory management is fully realizable with parallel processing schemes. One future direction of the proposed parallel processing mechanism is to employ it in a CUDA enabled GPU. In CUDA architecture threads are used in blocks of two or three dimensions. With this architecture and the specialized memory management designed by NVidia up to 30,000 physical processors could be achieved. This could easily result in processing 30,000 pixel models at once, achieving up to four orders of magnitude speed-up.
(a)
(b)
(c)
(d)
Fig. 4. The experiment shows the processing of input images in indoor activities: a) A sample background b) A sample frame with a moving object c) Foreground detected w/ noise and shadow d) Foreground after shadow removal
3
Experimental Results
The proposed technique has been tested on a variety of indoor video sequences. It has been used to track both fast and slow moving objects under different conditions and found to be very efficient. Figures 4 and 5 show the examples of the application of the algorithm in an indoor and outdoor respectively. One of the advantages of the proposed technique is its constant memory requirements, as well as training and testing times. Since the model for each pixel corresponds to the quantized histogram bins of the probability density function governing the color of the pixel, the model memory requirement is constant regardless of the number of training frames used to build the model. The training and detection times of our method were tested on videos with the frame size of 320x480 pixels. To calculate the training speed of the algorithm, 200 frames with the background and no foreground objects were used. The total training time in the original algorithm on one processor with 1.5GHz clock speed was about 40 ms per frame. When the algorithm was run on a dual and quad core processors the training time dropped 20 and 10 ms, respectively. The foreground
504
A. Kolawole and A. Tavakkoli
(a)
(b)
(c)
(d)
Fig. 5. The experiment shows the processing of input images in outdoor activities: a) A sample background b) A sample frame with a moving object c) Foreground detected w/ noise and shadow d) Foreground mask after shadow removal Table 1. Training, shadow removal, and foreground detection speed of the proposed algorithm per frame Algorithm Stage
Speed per frame (millisecond)
Training phase Shadow removal Foreground detection
40 0.2 2
detection time was even faster than the training stage. On average the proposed technique reached the foreground detection time of about 2ms per frame. The shadow detection and removal runs at about 200 μs.
4
Conclusions and Future Work
This paper has presented an accurate and fast foreground segmentation approach in still camera videos. Unlike existing real-time techniques that compromise on quality of segmentation, the proposed method achieves high processing speed with no compromise accuracy. The high sensitivity is achieved using adaptive thresholding. One of the contributions of this approach is that the data is structured in such a way that parallel processing can be done using the recent multi-core systems. This will make it faster since the histogram computations can be done in parallel making the system very fast and useful in real time environments. In future we aim to implement the proposed technique on CUDA. CUDA is NVIDIA’s parallel computing architecture which enables dramatic increases in computing performance by harnessing the power of CPU’s. However, the system is tested on single core system and is observed that it is very fast.
Robust Foreground Detection in Videos
505
References 1. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: real-time tracking of the human body. In: Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition. IEEE Computer Society, Los Alamitos (1996) 2. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. Pattern Analysis and Machine Intelligence 22, 747–757 (2000) 3. Charoenpong, T., Supasuteekul, A., Nuthong, C.: Background and foreground segmentation from sequence images by using mixture of gaussian method and k-means clustering. In: The 8th PSU Engineering Conference, pp. 400–403 (2010) 4. Sivabalakrishnan, M., Manjula, D.: Adaptive background subtraction in dynamic environments using fuzzy logic. International Journal on Computer Science and Engineering 2, 270–273 (2010) 5. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.: Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90, 1151–1163 (2002) 6. Mittal, A., Paragios, N.: Motion-based background subtraction using adaptive kernel density estimation. In: Computer Vision and Pattern Recognition, pp. 302–309 (2004) 7. Miyoshi, M., Tan, J., Ishikawa, S.: Extracting moving objects from a video by sequential background detection employing a local correlation map. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 3365–3369 (2008) 8. Hung, M.H., Pan, J.S., Hsieh, C.H.: Speed up temporal median filter for background subtraction. In: 2010 First International Conference on Pervasive Computing Signal Processing and Applications PCSPA, pp. 297–300 (2010) 9. Jaikumar, M., Singh, A., Mitra, S.: Background subtraction in videos using bayesian learning with motion information. In: Theoretical Aspects of Computer Software (2008) 10. Landabaso, J.L., Pardas, M., Xu, L.Q.: Shadow removal with morphological reconstruction. In: Proceedings of the Jornades de Recerca en Automatica, Visio i Robotica (AVR), Barcelona, Spain (2004) 11. Xu, L.Q., Landabaso, J.L., Pard` as, M.: Shadow removal with blob-based morphological reconstruction for error correction. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE Computer Society, Philadelphia (2005)
Deformable Object Shape Refinement and Tracking Using Graph Cuts and Support Vector Machines Mehmet Kemal Kocamaz, Yan Lu, and Christopher Rasmussen Department of Computer and Information Sciences University of Delaware, Newark, DE, U.S.A. {kocamaz,yanlu,cer}@udel.edu
Abstract. This paper describes several approaches to the problem of obtaining a refined segmentation of an object given a coarse initial segmentation of it. One line of investigation modifies the standard graph cut method by incorporating color and shape distance terms, adaptively weighted at run time to try to favor the most informative cue given visual conditions. We also discuss a machine learning approach based on support vector machines which uses color and spatial features as well. Furthermore, we extend these single-frame refinement methods to serve as the basis of trackers which work for a variety of object types with complex, deformable shapes. Comparative results are presented for several diverse datasets including objects such as trail regions used for robot navigation, hands, and faces.
1
Introduction
In this paper, we describe a form of shape estimation in which an initial, rough shape is refined automatically, without user interaction. The contribution of this work is a comparison of several algorithms for this purpose based on graph cuts [1] and graph cuts with distance maps [2], as well as support vector machines (SVMs) [3], and a demonstration of their efficacy on several diverse datasets. These techniques work on individual images for which rough object segmentations are available, but we have also extended them so that they may run in a self-sustaining fashion over a video sequence given an initial indication of the object location. An early motivation for this work came from a robotic application, trail following. Trails are man-made or natural visual features such as roads, hiking tracks, above-ground pipelines, rivers, and powerlines which can be navigationally exploited by unmanned vehicles. While the trail following problem may be considered as a form of road following [4–6], several factors make this task often harder, including indistinct borders, illumination changes, abrupt elevation changes, texture variety, and dead ends and forks. In [7, 8], a trail tracking algorithm was presented which takes into account primarily the color contrast difference between the trail and neighboring image G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 506–515, 2011. c Springer-Verlag Berlin Heidelberg 2011
Deformable Object Refinement and Tracking Using Graph Cuts and SVM
(a)
(b)
507
(c)
Fig. 1. (a) Sample trail image; (b) Coarse ground truth overlaid; (c) Detailed ground truth
regions. While the method works well in many instances, one shortcoming is that it maintains a low-dimensional representation of the trail shape. In [7] the trail was represented strictly in the image domain as a triangle (to account for perspective), and in [8] the trail was represented as a circular arc in vehicle coordinates, projected to an omnidirectional image. These representations are necessarily approximate, and thus can miss important border details and possible in-trail obstacles (see Fig. 1 for a sample of the difference between a coarse and detailed trail segmentation). Although the focus in this paper is on trail image data most relevant to mobile robot applications, we believe that the problem of automatically refining segmentations is a general one, and we will present results for images and videos of other kinds of objects with coarse segmentations. Graph cuts [1] are a popular and powerful computer vision technique which offer a potential solution to the problem of taking an approximate or coarse object segmentation and refining it to obtain more detailed and accurate object boundaries. User interactions are often required for best results [9], but automatically obtained coarse regions have been successfully used to seed the foreground/background regions needed for graph cut [10]. Work in this area has included using motion detection [11] and superpixel-based saliency measures [12] to obtain seed regions. Graph cut approaches for tracking objects have been described in [2, 13, 14]. In this paper, we describe a novel graph cut-based algorithm to refine an object’s estimated borders if an initial coarse estimate is given. The standard graph cut method [1] uses only intensity information in its formulation. Intensity alone is often not enough to segment or track objects with diverse appearance and shape characteristics in a large range of images. Therefore, we made several changes in the standard graph cut method to increase the accuracy of the refinement task. First, an adaptive color space selection mechanism is employed in the algorithm. Second, color information in the image is clustered by k-means. The knowledge gathered from the clusters are included in the formulation of the region and boundary terms of the graph cut. Third, the spatial distance information is incorporated into the algorithm to constrain the final segmentation to stay around the region of interest. The influence of the distance penalty information is adaptively set by the algorithm. For comparison, we also looked at applying SVMs to the same problem to learn a model of the object using both appearance and spatial features, and
508
M.K. Kocamaz, Y. Lu, and C. Rasmussen
then to classify an image into object and non-object pixels based on the learned model. SVM is an effective supervised learning method that has been widely used for data analysis, pattern recognition, and classification. Finally, we extend these single-image refinement methods to work as trackers over video sequences of the object moving and deforming. The graph cut tracking algorithm which only has color information is called as Color GC-Tracker , the one which also includes distance penalties is called ColorD GC-Tracker , and the SVM-based tracker is SVM-Tracker . The rest of the paper is organized as follows. Section 2 describes the modifications we made in the standard graph cut method and the training and the classification stages of an SVM to segment object borders in a single image if a coarse estimate trail is given. Section 3 describes how these approaches can easily be extended to build tracking algorithms. Section 4 explains the experiments, shows some results of the algorithms, and compares their accuracies with ground-truth data. Finally, in Section 5, we summarize the methods and their results.
2
Single Frame Image Segmentation
This section describes how an object region is segmented by our graph cut and a support vector machine based algorithms given in a single image. Both techniques require to take a background and foreground region information in the image, MB and MF , respectively. MF is obtained by scaling down initial estimated shape of the object, Mt . MB is obtained by scaling up Mt and taking all the pixels outside of that region. Mt can come from manually segmented ground truths of the object, previous frames in the tracking procedures, or can be given by any other algorithm. Figure 2(b) shows MB and MF . 2.1
Graph Cut Segmentation
In this subsection, we describe the changes made in the standard graph cut segmentation algorithm proposed in [1, 15] to make it more suitable for the object segmentation and tracking purpose. An image segmentation task is converted to a foreground/background labeling problem in the graph cut method. The following energy function is minimized by graph cut to assign a foreground or background label to each pixel in the image: n E(L) = Ri (Li ) + λ B(Li , Lj ) (1) i=0
{i,j}∈N
where L is the segmentation labeling set, Ri (Li ) is called as regional term and B(Li , Lj ) as boundary term. λ ≥ 0 is a weight term to set the relative influence of boundary term versus regional term in the function. In this paper, ’object’ refers to the foreground.
Deformable Object Refinement and Tracking Using Graph Cuts and SVM
(a)
(b)
509
(c)
Fig. 2. (a) shows the image overlaid with its ground-truth polygon; (b) its obtained background model, red colored pixels, and foreground model, green colored pixels, for the graph cut and SVM algorithms; (c) and its distance penalty map for the graph cut algorithm
Automatic Color Space Selection. Standard graph cut algorithm requires some feature information from the background and foreground regions in the image. Choosing these models as much as informative and distinctive helps graph cut to produce better segmentation results. Our image sets contain high illumination changes. Working and sticking to only one color space such as RGB does not always produce good results. We investigate that in some cases using CIE-LAB color space gives better results. Therefore, we developed a mechanism to switch among the color spaces during the run time of the algorithm to improve the accuracy. Four possible color feature spaces are considered for this purpose: RGB, LAB which uses three channels of CIE-LAB color space, AB which uses only the chromaticity information, and L which uses only brightness of CIE-LAB. To achieve self-adaptiveness among the feature spaces, our method collects some feature statistics from the inside and outside of the object regions in run time. To do that, we cluster all the pixels in the image into k different labels by applying k-means algorithm. The number of the clusters is chosen as 12, k = 12, for k-means. K-means is separately performed four times in RGB, LAB, AB and L feature spaces and four different cluster labels are obtained from the image. An object region color distribution, MF , is modeled by a histogram h = (f1 , . . . , fk ) of the label frequencies inside it. The background region model, MB , is formed in the same way. This allows to capture multi-model color distribution from the inside and outside of the object. We measure the appearance dissimilarity, d, between MB and MF by histogram distance function which is chi-squared metric χ2 (hi , hj ). This measurement is done for each feature space and separate dissimilarity values are retrieved, dRGB is for RGB, dLAB is for LAB, dAB is for AB and dL is for L feature spaces. The feature space which provides the highest dissimilarity taken as the working color space of the graph cut. In this way, more informative and distinctive foreground and background models are provided to the graph cut. Regional term and Boundary term of the graph cut are changed as follows: P r(li |MF ) ) (2) Ri (”Object”) = −ln( P r(li |MF ) + P r(li |MB )
510
M.K. Kocamaz, Y. Lu, and C. Rasmussen
Ri (”Background”) = −ln(
P r(li |MB ) ) P r(li |MF ) + P r(li |MB )
B(i,j) = |Ri (”Object”) − Ri (”Background”)|
(3) (4)
where i and j are two neighbor pixels, li is k-means label of pixel i, MF and MB are the histogram label frequencies of the object and the background regions in the image, respectively. Distance Penalty Map Construction and Weighting. Graph-cut method produces global segmentation and tends to catch some unintended regions which are similar to the desired object. However, in the tracking procedure of the objects, the pixels labeled as the object in the current frame are most likely will be labeled as the object again in the next frame. To incorporate this biasing information, we need to penalize the pixels which are far away from the object region in the last frame. This penalty information is added to the standard graph-cut by constructing a distance penalty map as in [2]. However, we weight the distance penalty in a different way. Instead of guessing the location of the object in next frame as described in [2], the dissimilarity, d, between the object and background regions is chosen as a criteria to weight the distance penalty information adaptively. In our weighting technique, if the dissimilarity, d, is high between the regions, graph-cut is forced more to stay around the object location which is segmented in the last frame. Specifically, if MF and MB are similar to each other, we do not want graph cut to go far away from the last position of the object and look for more pixels to add to the object region, The distance map of the image is constructed as: M apdist (i) = i − closestT oi , where closestT oi is the closest pixel of the object to pixel i in the image space. i − closestT oi is the distance between closestT oi and pixel i. Figure 2(c) shows the distance penalty map of a given image. The distance penalty function is added to regional term of the standard graph cut algorithm in the following way: Ri (”Object”) = P r(Pi |O) + β(α(dcs )M apdist (i))
(5)
where P r(Pi |O) is the penalty of adding pixel i to the object region. β is a constant term to set the relative influence. α(.) is the negative log-likelihood function to weight the distance penalty adaptively. dcs is the smallest dissimilarity value returned by chi-squared metric among all feature color spaces considered in the algorithm. 2.2
SVM Segmentation
In this paper, we also apply SVM for the object segmentation as an alternative to the graph cut method. Our goal is to classify whether a pixel in the image belongs to the object region or not using a object model built by SVM. In this two-class classification problem, we define object pixels as positive examples and background pixels as negative examples. An object model is learned from selected
Deformable Object Refinement and Tracking Using Graph Cuts and SVM
511
color and geometric features using a radial basis SVM kernel. The features of a pixel are subset of (L, a, b, x), where (L, a, b) are the Lab color space components and x is simply the horizontal position of the pixel in the image coordinates. As a supervised learning algorithm, SVM needs training images with labeled positive and negative examples in order to build an object model. For the single image segmentation, such training images can be obtained from the ground truth. For the tracking case, training images for the current frame can be the tracking results of the previous frames, as explained in Section 3.2. We do not include pixels in the training images that are close to the boundaries of the negative and positive classes, because the classification of those pixels are marginal. At the classification stage, we use n-fold cross validation to enhance the accuracy. The largest connected component of all positive examples are then treated as the object region.
3
The Trackers
This section explains the details of our two proposed algorithms to track the objects. Both trackers are initialized with the ground truth positions of the object in the first frame. 3.1
Graph Cut Tracker
ColorD GC-Tracker is a single frame tracker which passes the color and position information of the object from the last frame to the next frame. It applies the graph cut segmentation algorithm, with the changes explained in Section 2, in each frame of the tracking process. The desired object region is retrieved as a mask from the graph cut. However, the raw foreground mask generated by graph cut technique might contain some noisy, small and weakly-connected foreground regions, because our images contain a non-homogeneous color distribution inside the foreground regions. In order to clean up those regions, we first do morphological opening and closing and find the connected components. The largest region is taken as the final object region. 3.2
Support Vector Machine Tracker
Similar to how we extend graph cut segmentation algorithm for tracking, the SVM object segmentation can also be extended to SVM-Tracker . After an initialization, the object model is updated based on the classification results from the previous n frames by adding the newly predicted positive and negative examples and throwing away the oldest examples, like a sliding window scheme. However, the training data for SVM is no longer ground truth but the classified results from the previous n frames. We exclude boundaries pixels of two classes when training as mentioned in Section 2.2.
512
4
M.K. Kocamaz, Y. Lu, and C. Rasmussen
Results
We experimented with 3 different data-sets. The Trail dataset from [8] consists of 17,358 frames of video taken along a hiking/mountain biking trail in a mixture of field and forested terrain. The Head dataset is a short 383-frame clip from a standard video compression benchmark in which a person’s head bobs around in the back of a car. Finally, Hand is a 5-minute (5,616-frame) video recorded outside our lab of a hand waving and gesturing in front of a complex background. We manually generated ground-truth object segmentations for about 5 − 10% of each dataset’s frames at regular intervals. We use the following area overlap formula suggested by [16] to measure the accuracies between the ground-truth segmentations and our results: Overlap(R1 , R2 ) = A(R1 ∩ R2 )2 /(A(R1 )A(R2 )), where R1 and R2 are given two regions to calculate the overlap between them. The SVM classifier for this work is implemented using LIBSVM [17] with a 10-fold cross validation. 4.1
Single Frame Image Segmentation Experiments
In this experiment, we aimed to see the accuracies of the algorithms when only one single image is given and expected to segment the object borders in it. We performed this experiment for the images which we have ground-truths for. In order to see the performance of adding distance penalty to the graph cut, we removed the distance information from the graph cut method and applied the graph cut method which only includes the changes described in Section 2.1. All experimented methods are initialized with the ground-truth segmentations, so Mt is set to be the ground-truth. Table 1 summarizes the median overlap scores between the methods and the ground-truth segmentations for each data sets. Some results of this experiment are shown in Figure 3. As expected, adding the color information to the standard graph cut helps to increase the accuracy of the segmentation. Also, incorporating the distance penalty and weighting it according to the dissimilarity between MF and MB improves the performance. For SVM segmentation, we subsampled positive and negative examples from training images for efficiency purposes at a rate of 1 out of every 8 pixels in a regular basis. Table 1. Median overlap scores of the single frame image segmentation results for the data sets Method Name Standard GC [1] Color GC ColorD GC SVM
Trail 0.55 0.69 0.79 0.76
Hand 0.79 0.95 0.97 0.95
Head 0.52 0.84 0.89 0.86
Deformable Object Refinement and Tracking Using Graph Cuts and SVM
Ground-truth
Graph Cut
Color GC
ColorD GC
513
SVM
Fig. 3. Sample results for single frame refinement on the trail dataset
4.2
Measuring the Accuracies of the Trackers
In this experiment, the accuracies of the trackers described in Section 3 are measured and analyzed. The trackers are started with the ground-truth image in the first frame and run until the frame which has a ground-truth. The generated object mask in the last frame is saved. Total 502 results, the number of the images in the ground-truth sets, are produced in this way for Trail, Hand, and Head data sets. Color GC-Tracker does not use distance penalty, it is the method described in Section 2.1 without distance penalty. Table 2 shows the median overlap scores of this experiment. Some of final segmented trail, hand and head regions at the end of the sequences can be shown in Figure 3 and 4. Overall accuracy of ColorD GC-Tracker is comparable with SVM-Tracker as can be seen in Table 2. ColorD GC-Tracker performs better than SVM-Tracker in this test. The big overlap score difference between SVM-Tracker and ColorD GC-Tracker for Head set test is caused by similar colored background and foreground regions in the images. Distance penalty information incorporated in ColorD GC-Tracker helped it for better tracking. For SVM tracker, we use the same feature vector as that for single image segmentation. We assign different weights for positive and negative examples during training stage based on the ratio between the number of positive and negative examples. Such an adaptive weight assign scheme helps to stabilize the SVM tracker by balancing the number of both examples for training. SVM tracker might lose tracking when all pixels in the images are classified as positive or negative given the model trained from the detection results of the previous frames. This happens due to an abrupt illumination changes, such as entering a shadow region or sudden camera exposure change. Therefore, we do not update the object model when all pixels are classified into one class, and the SVM tracker can pick up the object region back after several frames when the illumination condition is stable. The SVM results in Table 2 are obtained when n = 1 as mentioned in Section 3.2.
514
M.K. Kocamaz, Y. Lu, and C. Rasmussen
Color GC
ColorD GC
SVM
Fig. 4. Sample tracking results on all three data sets (the trail first column has ground truth overlaid) Table 2. Median overlap scores of the trackers for the data sets. Since CC-Tracker is specific to tracking trails, it does not have overlap scores for the Hand and Head data sets. Method Name CC-Tracker [8] Color GC-Tracker ColorD GC-Tracker SVM-Tracker
5
Trail 0.72 0.46 0.77 0.74
Hand — 0.92 0.96 0.95
Head — 0.27 0.82 0.68
Conclusion
We described two algorithms to segment the object borders in a single image if an estimated position of it is provided and their extended tracking algorithms. They are separately build on two well known methods, graph cut segmentation technique and support vector machines. In order to improve the accuracy of the standard graph cut method, adaptive color space selection mechanism, distance penalty and its adaptive weighting method are employed in it. Also, the image is clustered by k-means and its cluster labels are used in the formulation of graph cut terms. Our SVM based tracker is trained and tested by incorporating various features selected from color and 3D space of the object. We compared and analyzed these methods with a previously published color contrast trail tracker and standard graph cut method. As a future direction of this research, SVM and graph cut can be combined in one single tracker to utilize from the advantages of two methods. SVM provides the opportunity to maintain the color and structure information over the
Deformable Object Refinement and Tracking Using Graph Cuts and SVM
515
image sequences. Graph cut has the advantage of encompassing the connectivity information of neighbor pixels in the image. Maintaining the background and foreground models by SVM and using these models in the graph cut segmentation might improve the accuracy and robustness of the tracker. Also, object location prediction mechanism can be added to the trackers. In this way, a better object and background models can be obtained to feed the algorithms.
References 1. Boykov, Y., Funk-Lea, G.: Graph cuts and efficient n-d image segmentation. Int. Journal of Computer Vision 70, 109–131 (2006) 2. Malcolm, J., Rathi, Y., Tannenbaum, A.: Tracking through clutter using graph cuts. In: British Machine Vision Conf., BMVC (2007) 3. Burges, C.: A tutorial on support vector machines for pattern recognition. In: Data Mining and Knowledge Discovery, pp. 121–167 (1998) 4. Taylor, C., Malik, J., Weber, J.: A real-time approach to stereopsis and lanefinding. In: Proc. IEEE Intelligent Vehicles Symposium (1996) 5. Southall, B., Taylor, C.: Stochastic road shape estimation. In: Proc. Int. Conf. Computer Vision, pp. 205–212 (2001) 6. Huang, A., Moore, D., Antone, M., Olson, E., Teller, S.: Multi-sensor lane finding in urban road networks. In: Robotics: Science and Systems (2008) 7. Rasmussen, C., Lu, Y., Kocamaz, M.: Appearance contrast for fast, robust trailfollowing. In: Proc. Int. Conf. Intelligent Robots and Systems (2009) 8. Rasmussen, C., Lu, Y., Kocamaz, M.: Trail following with omnidirectional vision. In: Proc. Int. Conf. Intelligent Robots and Systems (2010) 9. Rother, C., Kolmogorov, V., Blake, A.: Grabcut - interactive foreground extraction using iterated graph cuts. In: SIGGRAPH (2004) 10. Kocamaz, M., Rasmussen, C.: Automatic refinement of foreground regions for robot trail following. In: Proc. Int. Conf. Pattern Recognition (2010) 11. Dinh, T., Medioni, G.: Two-frames accurate motion segmentation using tensor voting and graph-cuts. In: IEEE Workshop on Motion and Video Computing (2008) 12. Mehrani, P., Veksler, O.: Saliency segmentation based on learning and graph cut refinement. In: Proc. British Machine Vision Conference (2010) 13. Nelson, A., Neubert, J.: Object tracking via graph cuts. In: SPIE Applications of Digital Image Processing (2009) 14. Papadakis, N., Bugeau, A.: Tracking with occlusions via graph cuts. IEEE Trans. Pattern Analysis and Machine Intelligence 33, 144–157 (2011) 15. Boykov, Y., Jolly, M.: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In: Proc. Int. Conf. Computer Vision (2001) 16. Sclaroff, S., Liu, L.: Deformable shape detection and description via model-based region grouping. IEEE Trans. Pattern Analysis and Machine Intelligence 23 (2001) 17. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), Software, http://www.csie.ntu.edu.tw/~ cjlin/libsvm
A Non-intrusive Method for Copy-Move Forgery Detection Najah Muhammad1, Muhammad Hussain2, Ghulam Muhamad2, and George Bebis2,3 1
CCIS, Prince Norah Bint Abdul Rahman University 2 CCIS, King Saud University, Saudi Arabia 3 CSE, University of Nevada, Reno, USA [email protected], [email protected], [email protected]
Abstract. The issue of verifying the authenticity and integrity of digital images is becoming increasingly important. Copy-move forgery is one type of image tempering that is commonly used for manipulating digital content; in this case, some part of an image is copied and pasted on another region of the image. Using a non-intrusive approach to solve this problem is becoming attractive because it does not need any embedded information, but it is still far from being satisfactory. In this paper, an efficient non-intrusive method for copy-move forgery detection is presented. The method is based on image segmentation and a new denoising algorithm. First, the image is segmented using a multi-scale segmentation algorithm. Then, using the noise pattern of each segment, a separate noise image is created. The noise images are used to estimate the overall noise of the image which is further used to re-estimate the noise pattern of different segments. The image segments with similar noise histograms are detected as tampered. A comparison with a state-of-the art non-intrusive algorithm shows that the proposed method performs better.
1 Introduction Due to recent advances in imaging technologies, it has become very easy to preserve any event in the form a digital image, and this digital pictorial information is being used widely for multiple purposes. On the other hand, due to the development of sophisticated editing software, even a novice person can tamper with the digital contents with an ease. Authenticity of images cannot be taken for granted. The issue of verifying the authenticity and integrity of digital contents is increasingly becoming important. This motivated the need of techniques which can be used to validate the authenticity of digital content. The existing techniques for forgery detection can be classified into two main categories: intrusive and non-intrusive. Intrusive techniques required that some sort of digital signature is embedded in the image at the time of its creation. Therefore, their scope is limited because not all digital devices have the capability of embedding a digital signature at the time of capturing an image. On the other hand, non-intrusive approaches do not require embedding any information. Though non-intrusive approach is attractive, and some work has been done in this direction, research on this approach is still in its infancy, and more efforts are required for proposing stable solutions. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 516–525, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Non-intrusive Method for Copy-Move Forgery Detection
517
Copy-move forgery is one type of tempering that is commonly used for manipulating digital content; in this case, a part of an image is copied and pasted on another region of the image. In this paper, the focus is on detecting copy-move forgery. The task of tamper detection becomes more difficult with copy-move forgery because the copied region will have the same characteristics of the image such as noise component, color palette, dynamic range etc. This indicates that detection methods that search for tampered image regions using inconsistencies in statistical measures will fail. There are a number of methods that provide solutions for copy-move forgery detection. Each of these methods provides a solution under a set of conditions or assumptions; the method will fail if its assumptions are not realized [3, 9, 13, 14]. In this paper, we present preliminary results on a new algorithm for copy-move forgery detection. Our solution is based on the idea that copied and pasted regions must have the same noise pattern. The proposed solution depends on image segmentation and noise estimation for each segment. The noise patterns of the image segments are then compared for identifying forgery. Image segments with similar noise patterns are detected as tampered. The proposed method outperforms state-ofthe art methods. The rest of the paper is organized as follows. The next section discusses published work related to copy-move forgery detection. In Section 3, the proposed method is explained. Section 4 contains the experimental results for the proposed method. In Section 5, we discuss our results and Section 6 concludes the paper.
2 Related Work This section gives an overview of non-intrusive methods dealing with copy-move forgery. The most commonly used non-intrusive approach for copy-move forgery detection is based on block matching. In block based methods, an image is partitioned into equal sized blocks, and tempering is detected using feature similarities between image blocks. The features of each block are extracted to form a feature vector. The feature vectors are then sorted so that similar vectors are grouped together and neighboring information is analyzed; a similarity threshold is set based on experiments. Similar feature vectors indicate that their corresponding image blocks are copies of each other. In [3], a detection method based on matching the quantized lexicographically sorted discrete cosine transform (DCT) coefficients of overlapping image blocks has been proposed. Experimental results show reliable decisions when the retouching operations are applied. However, the authors don't show robustness tests. Another method which is invariant to the presence of blur degradation, contrast changes and additive Gaussian noise is presented in [11]. Features of the image blocks are represented by a blur using moment invariants. The experimental results show that the algorithm performs well with the blurring filter and a lossy JPEG compression quality down to 70. However, like other similar methods, this algorithm may falsely label unmatched areas as matched. This problem arises in case of uniform regions such as sky. Another disadvantage of this algorithm is its computational time. The average running time of the algorithm with block size of 20, a similarity threshold 0.97 and image size of 640×480 RGB image, using a processor of 2.1 GHz and 512 MB RAM is 40 minutes.
518
N. Muhammad et al.
In [6], Singular Value Decomposition is used to obtain singular values feature vectors for block representation. The feature vectors are sorted using lexicographical sort. The experimental results show that the algorithm is comprehensive. It has been shown that the algorithm performs well even in images with uniform areas such as sky and ocean. The running time with one color channel of 256×256 images running on a 1.8 GHz processor and 256MB RAM when block size is 20, is approximately 120 seconds. In [4], SIFT features has been used. The experimental results show that about 38 matches can be reached if the threshold for Euclidean distance between the matched descriptor vectors is set to 0.45. The algorithm proposed in [16] is based on pixel matching to detect tampering. The approach uses the Discrete Wavelet Transform (DWT) to get reduced data representation. Also, phase correlation is used to compute the spatial offset between the copied and pasted regions in the image. The work in [1] uses a feature representation that is invariant not only to noise addition or blurring, but also to several geometric transformations such as scaling and rotation that may be applied to the copied region before pasting. These properties are achieved using Fourier-Mellin Transform (FMT) feature representation. The counting bloom filter [1] is used instead of lexicographic sorting to improve time complexity. Their experimental results showed that the proposed representation is more robust to JPEG compression. Furthermore, it can deal with rotations up to10°. The method proposed in [8] divides each image into overlapping blocks of equal size, and represents each block with nine features, which are normalized to integers in the range [0, 255]. Then a counting sort [5] is used to sort the feature vectors. Experimental results show that about 98% of detection rates can be achieved with different sets of 50 images with/without modifications such as compression and Gaussian noise. This method can detect a copy-move forgery with rotation. The most important issues in the block matching approach are: the block feature representation and the sorting algorithm. A robust feature extraction method must be employed that is insensitive to different types of post-processing and involves the lowest complexity. In addition, the sorting algorithm must have the lower run time complexity.
3 Proposed Method Copy-move tampering is done by copying a region of the image and pasting it on another place in the same image. Blurring may be applied to hide borders and to integrate the pasted region with the image background. When a region is copied and pasted to another place, it will keep some of its underling features that can be used to indentify tampering. The feature used here is the noise pattern. Specifically, we studied the noise present in an image and found that an original image has different noise patterns associated with regions related to different objects. However, a tampered image, where one of its regions is replicated, will have almost the same noise pattern for both the copied and pasted parts. The general framework of our algorithm is as follows:
A Non-intrusive Method for Copy-Move Forgery Detection
519
Step-1: Segment the input image. Step-2: Using the segmented image, estimate image noise. Step-3: Analyze noise pattern of each segment. Step-4: The image is tampered if the noise patterns of at least two segments are similar In the following subsections, we elaborate on each step. 3.1 Image Segmentation The input image I of size m×n is segmented into j segments S1, S2 ,…, Sj. in such a way that each object is fully contained in a single segment, and the segment is almost homogeneous. For this purpose, we used the algorithm presented in [18]. This algorithm works on multiple scales of the image in parallel, without iteration, to capture both coarse and fine level details. This segmentation algorithm works simultaneously across the graph scales, with an inter-scale constraint to ensure communication and consistency between the segmentations at each scale. 3.2 Image Noise Estimation Though many different denoising algorithms exist in the literature, which can be used for noise estimation, each algorithm focuses on a particular type of noise. So these methods do not serve our purpose. We present a new noise estimation method, which can be useful for other applications as well. Our idea of denoising and noise estimation is based on the following well-known fact [19]: "If the image g(m, n) is formed by averaging the noisy versions , of an image , , then , approaches , when the number of noisy versions fi (n, m) is sufficiently large". We use segments of an image to generate its noisy versions. The detail of the noise estimation algorithm is given below. Step-1: For each segment
, compute the average gray level gi =
∑
li x =1
Si(x)
li
as follows: (1)
where is the size of i.e. the total number of pixels in and is the intensity of the pixel x. Each segment is represented as a vector. Step-2: Subtract the average gray level from each pixel value of , the result is the noise segment Si' because each segment is homogeneous. (see Step-3: For each noise segment , create a noise image Ii of size Sub-section 3.2.1) i.e. Step-4: For each noise Ii, subtract Ii from I to get the denoised image (2) Step-5: Find the average of the denoised images: Id = ∑
(3)
520
N. Muhammad et al.
Step-6: Estimate the noise by subtracting the average denoised image from the given image: NL = I - Id (4) NL is noise estimation of the image and is used to analyze the noise pattern corresponding to each segment. 3.2.1 Creating a Full Noise Image from Noise Segment In this section, we give the details of our method for creating a full noise image from each segment. The main steps of the algorithm are as follows: Step-1: For each segment , create a list of the pixel values in S . Step-2: Create an image of size m×n containing random numbers where each random number r is an integer between 0 and -1, where li is the number of pixels in S . Step-3: Replace each random number r in with the value from list at index r.
Fig. 1. (a) The input image, (b) its segmentation into five segments; (c, d, e, f, g) are the pairs of histograms corresponding to the noise segments Si', i=1, 2, 3, 4, 5 and the respective noise images Ii, i=1, 2 ,3, 4, 5
This method creates a full noise image from a given segment having same statistical properties as of the segment. As noise is a random signal and is created using a random process from the given segment, then the noise in and the segment will have the same statistical properties. This assertion is further validated by the histograms of the segments and those of the corresponding noise images as shown in Fig. 1. The figure shows that the histogram of a segment and that of its corresponding noise image are similar. 3.3 Analyzing Noising Pattern for Each Image Segment For each segment Si obtained in the image segmentation step, the corresponding segment NSi is extracted from the estimated noise image NL. To analyze the noise pattern of NSi, we compute its histogram hi with l bins and calculate the probabilities (5) from the histogram hi.
A Non-intrusive Method for Copy-Move Forgery Detection
521
We found that the histograms of the copied and pasted segments are almost the same. 3.4 Detect Tampered Regions The estimated noise patterns corresponding to the copied and the pasted regions have the similar histograms. In view of this, detecting the tampered regions is equivalent to checking the similarity of the corresponding histograms. Many different methods can be used to test the similarity of histograms. We employed the following simple statistical measures: Moments of first and second order L
mi = ∑ x j p i ( x),
j = 1,2
(6)
x=0
Central moments of order up to 4 L
μ i = ∑ ( x − m1 ) pi ( x ), j
j = 1,2,3,4
(7)
j = 1,2,3,4
(8)
x =0
Absolute central moments of order up to 4 L
μˆ i = ∑ | x − m1 | pi ( x), j
x=0
After performing several tests, we found that the similarity of the histograms can be detected using central moment of the second order. As such, we selected it to be the measure of the histogram similarity.
4 Results and Comparison We tested the performance of our proposed method on a number of forged images, and found encouraging results. In this section, we present test results on 4 images, which are shown in Figures 2~5. For each image, we choose the minimum number of segments that can segment the image objects correctly.
Fig. 2. (a) Original image, (b) Tampered image, (c) Segments of the image in (b) white region corresponds to a segment, (d) detected tampered region
522
N. Muhammad et al.
Fig. 3. (a) Original image, (b) Tampered image, (c) Segments of the image in (b) white region corresponds to a segment, (d) detected tampered regions
Fig. 4. (a) Original image, (b) tampered image, (c) segments of the image in (b) white region corresponds to a segment, (d) detected tampered regions
Fig. 5. (a) Original image, (b) Tampered image, (c) Segments of the image in (b) white region corresponds to a segment, (d) Detected tampered regions
Fig. 2(b) is the tampered image where the upper object has been copied and pasted in the lower part. Fig. 2(c) shows the segmentation of the tampered image, here each white region corresponds to a segment. Fig. 2(d) shows the result of the detection process. The central moments for each segment are shown in the second row of Table 1; it is obvious that copied and pasted segments 3 and 5 have similar central moments and are detected correctly. Fig. 3(b) is the tampered image where the blue cap in the top row has been copied and pasted in the lower row. Fig. 3(c) shows the segmentation of the tampered image and Fig. 3(d) shows the result of the detection process. The central moments for all segments are shown in the fourth row of Table 1; one can see that copied and pasted segments 3 and 5 have been detected correctly. Fig. 4(b) is the tampered image where the picture in the lower part has been copied and pasted in the upper part. Fig. 4(c) shows the segmentation of the tampered image and Fig. 4(d) display the result of the detection process. The central moments for all segments are shown in the sixth of Table 1and it is clear that tampered segments 2 and 4 are detected correctly.
A Non-intrusive Method for Copy-Move Forgery Detection
523
(b) B=20, T=0.5
(a) B=40, T=1
Fig. 6. The detection result for the tempered image shown in Fig.4 using blocks size B and a similarity threshold T. In (a) we used the same parameters values which have been used in [12]. In (b) we changed the parameter values. Table 1. Central Moments Seg.# 1 Image 1 (Figure 2) 251.4 Image 2 (Figure 3) 324.16 Image 3 (Figure 4) 8.960 Image 4 (Figure 5) 29.44
2
3
4
5
6
7
8
9
132.66
55.44
2.8
55.44
103.0
13.2
130.6
183.0
87.04
5.04
5.44
5.04
2.40
8.24
2.40
136.00
24.96
1.360
95.44
15.04
69.04
100.96
7.360
10
11
100.96
62.64
Fig. 5(b) is the tampered image where the blue object has been copied and pasted on the right side of green object. Fig. 5(c) shows the segmentation of the tampered image and Fig. 5(d) shows the result of the detection process. The central moments for each segment are shown in the eighth row of Table 1 and it is obvious that tampered segments 9 and 10 are detected correctly. We also compared our method with a recent non-intrusive forgery detection method presented by Babak et al. [12]; this method partitions an image into equal size rectangular blocks and uses Discrete Wavelet Transform (DWT) to estimate the image noise for detecting image tampering. The noise feature used by them is MAD, median absolute deviation, which is employed to measure the noise inconsistency between blocks. If there is no noise inconsistency across all the blocks, then it is original, otherwise it is tempered. We applied their algorithm using our test images; the results show that our algorithm can produce more precise and clear results. For example, Fig. 6 shows the detection results of the tampered image depicted in Fig. 4; in this figure, the regions with homogenous noise level are shown in black while other regions are assigned random colors. Fig. 6(a) implies that the noise level is consistent over the entire image and there is no tampering but this is a false result. For Fig. 6(b), the green (and similarly pink) regions represent the places where the noise level is not consistent; the green region partially detects the tampered region whereas pink region is a false detection.
5 Discussion The proposed algorithm represents a promising non-intrusive algorithm for copymove forgery detection, which is based on the analysis of noise pattern. Other
524
N. Muhammad et al.
methods that use image noise for forgery detection are proposed in [2, 7, 10, and 17]; some of them require training a classifier with hundreds of images from several cameras. These algorithms can detect tampering in images captured by the same camera used to capture the training images. Because of that, these algorithms require previous knowledge about the camera used, which is not always available. However, our proposed algorithm finds the replicated regions in an image without any previous knowledge about the camera used to capture the image. The proposed algorithm is affected by the segmentation of the image. It can provide better results with segmentation algorithm that can segment an image into complete objects more accurately. As it can be observed from Fig. 5, the segmentation algorithm divides a single object into several parts. The unequal parts in both the copied and pasted objects in the image will have different statistical features and hence, will result in false detection.
6 Conclusion We have studied a challenging problem in digital image forgery detection. In this paper, we presented the initial finding of our study. We proposed a new algorithm that can effectively detect tampering in an image without requiring any knowledge about the camera used to capture the image. So far, we have tested our algorithm on images where the background is simple. We will explore it further for images with complicated background and texture. This will require a more robust and reliable segmentation algorithm. Second, besides using histograms, we will investigate using more robust features for representing noise patterns and being able to differentiate between the tampered and un-tampered segments. Acknowledgment. This work is supported by the National Plan for Science and Technology (NPST), King Saud University, Riyadh, Saudi Arabia under the project 10-INF1140-02. We are thankful to B. Mahdian and co-workers, the authors of the work in [12], for providing their code to compare their results with ours.
References 1. Bayram, S., Sencar, H.T., Memon, N.: An Efficient and Robust Method for Detecting Copy-Move Forgery. In: Proc. IEEE ICASSP, pp. 1053–1056 (2009) 2. Chen, M., et al.: Determining Image Origin and Integrity Using Sensor Noise. IEEE Transactions on Information Forensics and Security, 74–90 (2008) 3. Fridrich, J., Soukal, D., Lukas, J.: Detection of Copy Move Forgery in Digital Images. In: Digital Forensic Research Workshop, Cleveland, OH (2003) 4. Huang, H., et al.: Detection of Copy-Move Forgery in Digital Images Using SIFT Algorithm. In: Pacific-Asia Workshop on Computational Intell. Industrial App., pp. 272– 276 (2008) 5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn., Section 8.2: Counting sort, pp. 168–170. MIT Press and McGraw-Hill (2001) 6. Kang, X., Wei, S.: Identifying Tampered regions using singular value decomposition in Digital image forensics, pp. 926–930. IEEE Computer Society, USA (2008)
A Non-intrusive Method for Copy-Move Forgery Detection
525
7. Li, Y., Li, C.-T.: Decomposed Photo Response Non-Uniformity for Digital Forensic Analysis. In: Sorell, M. (ed.) e-Forensics 2009. LNICST, vol. 8, pp. 166–172. Springer, Heidelberg (2009) 8. Lin, H.-J., Wang, C.-W., Kao, Y.-T.: Fast Copy-Move Forgery Detection. WSEAS Trans. Signal Process, 188–197 (2009) 9. Lin, H.J., Wang, C.W.,, Y.: Fast Copy-Move Forgery Detection. In: World Scientific and Engineering Academy and Society (WSEAS), pp. 188–197 (2009) 10. Lukáš, J., et al.: Detecting Digital Image Forgeries Using Sensor Pattern Noise. In: Proc. of SPIE (2006) 11. Mahdian, B., Saic, S.: Detection of Copy Move Forgery Using a Method Based on Blur Moment Invariants. Forensic Science International 171, 180–189 (2007) 12. Mahdian, B., Saic, S.: Using Noise Inconsistencies for Blind Image Forensics. Image and Vision Computing, 1497–1503 (2009) 13. Popescu, A.C., Farid, H.: Exposing Digital Forgeries by Detecting Duplicated Image Regions. Dept. Comput. Sci., Dartmouth College,Tech.Rep. TR2004-515 (2004) 14. Sutcu, Y., et al.: Tamper Detection Based on Regularity of Wavelet Transform Coefficients. In: Proc. IEEE ICIP, pp. 397–400 (2007) 15. Wang, J., et al.: Detection of Image Region Duplication Forgery Using Model with Circle Block. In: International Conf. Multimedia Inform. Network. and Security, pp. 25–29 (2009) 16. Zhang, J., Feng, Z., Su, Y.: A New Approach for Detecting Copy-Move Forgery in Digital Images. In: IEEE Singapore Int. Conf. Comm. Sys., China, pp. 362–366 (2008) 17. Zhang, P., Kong, X.: Detecting Image Tampering Using Feature Fusion. In: International Conference on Availability, Reliability and Security, pp. 335–340 (2009) 18. Cour, T., et al.: Spectral Segmentation with Multiscale Graph Decomposition. In: IEEE International Conference on Computer Vision and Pattern Recognition, CVPR (2005) 19. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. (2002)
An Investigation into the Use of Partial Face in the Mobile Environment G. Mallikarjuna Rao1, Praveen Kumar1, G. Vijaya Kumari2, Amit Pande3, and G.R. Babu4 1 Department of Computer Science, Gokaraju Rangaraju institute of engineering and technology, Hyderabad [email protected], [email protected] 2 Computer Science Department, Jawaharlal Nehru Technological University, Hyderabad [email protected] 3 Computer Science Department, University of California-Davis, USA [email protected] 4 Department of ECE, KMIT, Hyderabad [email protected]
Abstract. Face recognition has been extensively explored in diversified applications on ubiquitous devices. Most of the research has been primarily focused on full frontal/profile of facial images while proposing novice techniques to pursue this problem. The resource constraint in Mobile devices adds more complexity to the face recognition process. To reduce computational requirements some investigations are made to use the partial faces for recognition process. However, the inadequate information in partial faces makes the problem much more challenging and therefore limited attempts have been made in this direction. Our Active pixel based approach is capable of recognizing the persons using either full or partial face information. The technique reduces the computational resources compared to the LBP which was claimed as one of the most suitable approach on mobile devices. We have carried out the experiments on the YALE facial databases. Other works [1, 2] have used 50% vertical portion and showed the accuracy of correct recognition 94% within best five matches. In our dynamic partial matching we have used 10% to 34% image and obtained correct recognition rate from 96% to 100% within best three matches. Keywords: Face Recognition, partial faces, LBP, Brody Transform, Active Pixel, Eigen template.
1 Introduction The increasing trend toward image and video based applications has stimulated interest in image processing community, both in academia and industry. Face Recognition is one of the key applications which has great potential use for mobile based systems. For example, Oki Electric's Face Sensing Engine (FSE), enables instant face recognition using the camera on mobile phones to restrict unauthorized G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 526–535, 2011. © Springer-Verlag Berlin Heidelberg 2011
An Investigation into the Use of Partial Face in the Mobile Environment
527
access to the information on the mobile phones. Google uses face recognition in Picasa to let users tag some of the people in their photos and then searches through other albums to suggest other pictures in which the same faces appear. The presence of digital camera in majority of mobile phones enables the people to acquire photos of the people they see on the move. Using face recognition techniques on these images makes it possible to perform so called face tagging to tag images with the names of the photographed persons. Also, it can help security personnel on the move to identify a potential suspect by recognizing suspect’s face from the database and accessing his history. Thus, there is great need to incorporate face recognition technologies onto mobile devices to facilitate such standalone mobile applications. However there are four major problems that need to be solved, namely the limited storage and processing power of the mobile device, connection instability, security and privacy concerns, and limited network bandwidth [3]. Problems of connection instability and limited network bandwidth can be avoided by performing the face recognition on the (client) mobile device after transferring the trained data form a computer (sever). This kind of architecture has been conceived for other applications on mobile as well in [4, 5]. In [5], the author’s uses a DCT-based compression method to store the image database on mobile device and the recognition algorithm runs directly on the compressed database without decompression. This allows on the spot-field usage, reduces the overhead of network transfer, and can address even the security and privacy issues. The solution to other two problems concerning with the limited memory and processing power on mobile devices is also being explored. In a recent master’s thesis [6], some work is done to compare variety of well-known face recognition algorithms on a holistic metric of accuracy, speed, memory usage, and storage size for execution on mobile devices. Based on the results, the author declares LBP as an overall optimal algorithm. However, it was observed by us that the efficiency achieved by even LBP method is not satisfactory enough especially when the size of the database increases. The face recognition using partial images is also aimed to reduce the computational resources on mobile devices. The partial images are used in robust elastic partial matching [1, 2] where the authors attempted the partial face recognition which was not explicitly targeted for mobile environment. They have experimented with YALE dataset and chosen the vertical strips which cover 50% of the original image. The static cropping was done for all test images using third party software. The best five matches are used in the recognition process for randomly selected 10 images. The Neural classifier uses training data set that covers complete profile of each image. Their results indicated 70% correct recognition within three matches and above 94% with fourth and fifth match. In our active pixel [7] approach explained in section 2, we have used dynamic cropping for the test image and it was correlated against associated class templates that cover the same region. The highly correlated class is then used to provide the first and second best matches. Figure 1 shows that even a portion of the face is sufficient for the detection process. It shows the cropped region of the selected subject from YALE dataset [8] and the best two matches. The complete experimentation and results are illustrated in Section 3. The paper concludes in Section 4.
528
G .Mallikarjuna Rao et al.
Fig. 1. Recognition using Partial Images: A: Subject 6, B subject 7 from Yale Database[7]
2 Our Approach 2.1 Brody Transformation The Brody transformation or R-Transform[9], proposed by Reitboeck and Brody is having shift invariant feature and the transformed data are independent of cyclic shifts of the input signal. The transformation works on the similar grounds of FFT which can’t deal with cyclic shift invariance. The Brody transform is effectively used in many pattern recognition problems. This basic R-transform does not provide its own inverse. Many researchers proposed inverse using various symmetric functions so that it contains one to one relation. We have proposed the inverse for Brody transformation [7] with two basic symmetrical operations (multiplication and division). Figure 2 gives the signal flow of Brody Transform. 2.2 Active Pixel The local variation in the image gives essence of small region which will be used as a tool for subsequent recognition process. We proposed the active pixel to refer that
An Investigation into the Use of Partial Face in the Mobile Environment
529
Fig. 2. Signal flow diagram of Brody Transform
image portion which contains vital information of local intensity variations. The active pixel count is used to provide the signature of the local region. We call the first element of the spectral distribution of the brody transformed pattern as the cumulative point index (CPI). The CPI represents the total spectral power of the image portion. The middle element of the Brody transform reflects the subtractive point index (SBI) that gives spectral difference of symmetric halves. The difference between CPI and SBI reveals the pixel relations in that region. The normalized difference between CPI and SBI is used as threshold to find the active pixel. The central pixel is said to be active pixel if 4 or more elements of the transformed pattern are greater than the threshold value. The back ground and uniform noise (uniform grey intensities) are successfully eliminated during the course of computation. The active pixel count is then used as a feature element characterizing each block. This forms the basis for our face recognition approach. The feature vector is then constructed for the entire image from these feature elements. 2.3 Comparing Our Approach with LBP The Local Binary Pattern (LBP) was proposed originally for texture recognition [10, 11], which gradually became most widely used by researchers for other domains. Recently it has been applied to face recognition especially to reduce memory and computational requirements. Here the image is divided into blocks and a 3x3 mask and is used to construct the binary pattern based on the relationship of neighbors with respect to the central pixel. The 8-neighbors are assigned the binary value ‘1’; if it is greater than central pixel value otherwise it is assigned binary value ‘0’. The weighted sum of these binary bits gives the integer which denotes the local feature. Although LBP is suitable for extracting local features with relatively lower time and space requirements, still it is not optimal for memory and power constrained environments
530
G .Mallikarjuna Rao et al.
like mobile devices. Hence researchers proposed variant versions of LBP techniques to address this problem. In [11], the author proposed invariant LBP using uniform patterns as shown in table 1. The 256 binary combinations generated by 8-bit input pattern yields 36 transformed patterns. The brody transform has a remarkable ability to represent the 58 uniform patterns generated by LBP [12] with only 8 transformed pattern classes as shown in the Table 1. Table 1. Pattern classes generated by Brody Transform for LBP uniform patterns
Binary Pattern of 8-neighborhood of 3x3 Mask 1 0 0 0 0 0 0 0; 0 1 0 0 0 0 0 0; 0 0 1 0 0 0 0 0; 0 0 0 1 0 0 0 0; 0 0 0 0 1 0 0 0; 0 0 0 0 0 1 0 0; 0 0 0 0 0 0 1 0; 0 0 0 0 0 0 0 1; Line end1 1 1 0 0 0 0 0 0; 0 1 1 0 0 0 0 0; 0 0 1 1 0 0 0 0; 0 0 0 1 1 0 0 0; 0 0 0 0 1 1 0 0; 0 0 0 0 0 1 1 0; 0 0 0 0 0 0 1 1;1 0 0 0 0 0 0 1; Corner 1 1 1 1 0 0 0 0 0; 0 1 1 1 0 0 0 0; 0 0 1 1 1 0 0 0; 0 0 0 1 1 1 0 0; 0 0 0 0 1 1 1 0; 0 0 0 0 0 1 1 1; 1 0 0 0 0 0 1 1; 1 1 0 0 0 0 0 1; Corner 2 1 1 1 1 0 0 0 0; 0 1 1 1 1 0 0 0; 0 0 1 1 1 1 0 0; 0 0 0 1 1 1 1 0; 0 0 0 0 1 1 1 1; 1 0 0 0 0 1 1 1; 1 1 0 0 0 0 1 1; 1 1 1 0 0 0 0 1; Corner 3 1 1 1 1 1 0 0 0; 0 1 1 1 1 1 0 0; 0 0 1 1 1 1 1 0; 0 0 0 1 1 1 1 1;1 0 0 0 1 1 1 1; 1 1 0 0 0 1 1 1; 1 1 1 0 0 0 1 1; 1 1 1 1 0 0 0 1; Edge 1 1 1 1 1 1 0 0; 0 1 1 1 1 1 1 0; 0 0 1 1 1 1 1 1; 1 0 0 1 1 1 1 1; 1 1 0 0 1 1 1 1; 1 1 1 0 0 1 1 1; 1 1 1 1 0 0 1 1; 1 1 1 1 1 0 0 1; Corner 4 1 1 1 1 1 1 1 0; 0 1 1 1 1 1 1 1; 1 0 1 1 1 1 1 1; 1 1 0 1 1 1 1 1; 1 1 1 0 1 1 1 1; 1 1 1 1 0 1 1 1; 1 1 1 1 1 0 1 1; 1 1 1 1 1 1 0 1; Line end 2 00000000 Flat 11111111 Spot
Transformed Brody Pattern 11111111
20202020
31113111
40004000
5111 3111
60202020
71111111
00000000 80000000
The feature vector of the image in active pixel approach contains less number of elements than LBP approach. Table 2 compares memory requirements of LBP with Active-Pixel based approach in terms of feature elements. 2.4 Feature Matching Using Correlation The similarities between two objects can be found using various distance measurements like Minkowski distance, Euclidean distance and correlation. In our approach we have used correlation as a similarity measure. The two highly correlated images are also said to be closely matching. Spearman Rank Correlation measures the
An Investigation into the Use of Partial Face in the Mobile Environment
531
Table 2. Memory requirements of LBP and Active pixel approach
Image Size
Nature of method (using 3x3 mask) Non-overlapping Overlapping Non-overlapping Overlapping Non-overlapping Overlapping Non-overlapping Overlapping
32 x 32 64 x 64 128 x128 256 x256
Number of Feature elements LBP Active Pixel 100 16 1024 16 448 32 4096 32 1820 64 16384 64 7280 1024 65536 1024
correlation between two sequences of values. The two sequences are ranked separately and the differences in rank are calculated at each position, i. The distance between sequences X = (X1, X2, etc.) and Y = (Y1, Y2, etc.) is computed using the following formula: ∑
1
The range of Spearman Correlation is from -1 to 1. Spearman Correlation can detect certain linear and non-linear correlations. The Correlation block computes the crosscorrelation of the first dimension of a sample-based N-D input array u, and the first dimension of a sample-based N-D input array v.
3 Face Recognition Procedure and Experimentation The face recognition is implemented with the Two-level matching approach of active pixels. In the Training stage Eigen templates are computed for each class. The normalized active pixels of the class forms Eigen-template. Later in the testing phase the first level matching is done by correlating the active pixels feature set of test image with each Eigen template. The best rank class in then further correlated with each image in that class. The top two ranked images form the matched set. The exhaustive experiments are performed on YALE [8] data set using Brody Transform and active pixels. The YALE database contains 165 images of 15 subjects covering different facial expressions. There are 11 faces for each subject of size 243 x 320. In our experimentation partial faces are selected using dynamic cropping as shown in figure 3. The probe set and training set are different. The probe was made on the subject 04 with glasses (Top row of figure 3) and best two recognized images from training set without glasses. Total 120 images are selected randomly in the testing process. The Table 3 and 4 gives the correct recognition, false recognition rates with first best match (Rank 1) and within best 3 matches (Rank 2). The recognition process contains Two-levels. In the first level the partial portion is tested against the 15 Eigen active pixel templates. The matched template gives the subject class and secondary matching is made to get the best two similar images. Tables 3 & 4 give the recognition rate against the first three best matches.
532
G .Mallikarjuna Rao et al.
Fig. 3. Partial probe and Two best matches Table 3. Recognition with Rank 1
Dynamic Cropped Size (% of Full image)
Rank 1(120 samples) Recognition Correct False
35x79(3%)
62
58
33x104(4%) 100x42(5%)
73 79
47 41
56x145(10%)
104
16
81x186(19%) 212x91(25%)
109 114
11 06
121x219(34%)
116
04
239x318(98%)
120
00
%Recognition Correct False 52 61
48 39
66
34
87 91
17 9
95
5
97 100
3 0
We compared our results with the robust elastic partial matching process [1, 2] that uses static cropping on YALE database. The authors have taken 50% of the image (vertical half) and their recognition process selects five spatial neighbors and then ranks them in the order of matching. The claim was made that correct recognition is 94% if the best five matches are taken into consideration and 50% reduction in computational resources (due to use of only 50% portion). In our approach we have dynamically selected different sizes of cropped portions and verified the best three matches. The observation showed that the accuracy of recognition is sensitive to the cropped region place and the size of partial image. The vertical strip covers more facial local regions and hence it gives better recognition rate than horizontal strip due
An Investigation into the Use of Partial Face in the Mobile Environment
533
Table 4. Recognition within Rank 3
Dynamic Cropped Size (% of Full image) 35x79(3%) 33x104(4%) 100x42(5%) 56x145(10%) 81x186(19%) 212x91(25%) 121x219(34%) 239x318(98%)
Within Rank 3(120 samples) Recognition Correct False
%Recognition Correct False
78 89
42 31
70 74
30 26
102
08
85
5
115 116
05 04
96 97
4 3
119
01
99
1
120 120
00 00
100 100
0 0
to its fine coverage of more local regions. The smallest cropping size is 3%(33 x 79) and largest cropped image size is 34% (121 x 219). The following Tables 3 & 4 gives recognition rate against first three best matches.
4 Conclusions The partial recognition was performed on YALE database. From the above tables it can be observed that the first match has given correct recognition rate up to 60% with cropped area less than 5% , up to 89% with 15% cropped region and above 90% if cropped region is 35%. The recognition rate is 100% beyond this. The recognition rate rose to 90% even with smallest cropped image (3%) if first four best matches are taken. The false recognition rate has a steep fall if second rank and third, rank images are also taken up in the recognition process. The results are illustrated by Figure 4 and Figure 5.
Fig. 4. Correct Recognition (%) using Rank 1 to Rank 4 matches
534
G .Mallikarjuna Rao et al.
Fig. 5. False Acceptance (%) using Rank 1 to Rank 4 matches
Most of the LBP approaches use computationally expensive detectors in the recognition process. We also did the timing analysis of the above techniques along with active pixel for comparing execution time which is shown in table 5. The programs were executed using MATLAB 7 on Dual core 1.88 GHZ processor with 2GB RAM. This clearly demonstrates that the computational requirement for active pixel based approach is fairly smaller in comparison to LBP based approaches. The partial image processing further reduces the computational resources. Table 5. Computational Time for LBP variants and Active Pixel approach
Approaches Time
LBP with PCA 10 hrs
LBP with LDA 12 hrs
LBP with correlation 8 hrs
Active Pixel with correlation 1 hr and 45 min
Acknowledgement. This work is partially supported by the National Science Foundation under Grant #1019343 to the Computing Research Association for the CIFellows Project.
References 1. Face recognition by elastic bunch graph matching. IEEE Trans. PAMI 19(7), 775-779 (1997) 2. Elastic and Partial Matching Metric for Face Recognition. IEEE 12th International Conference on Computer Vision, ICCV 2009, pp. 2082-2089 (2009) 3. Mukherjee, S., et al.: A secure face recognition system for mobile-devices without the need of decryption (2008)
An Investigation into the Use of Partial Face in the Mobile Environment
535
4. Hull, J.J., et al.: Mobile image recognition: architectures and tradeoffs, pp. 84–88. ACM, New York (2010) 5. Sen, S., et al.: Exploiting approximate communication for mobile media applications, pp. 1–6. ACM, New York (2009) 6. Junered, M.s.I.: Face recognition in mobile devices. Luleåtekniska university (2010) 7. Mallikarjuna Rao, G., Babu, G.R., Vijaya Kumari, G., Krishna Chaitanya, N.: Methodological Approach for Machine based Expression and Gender Classification. In: IEEE Int. Advance Computing Conference, pp. 1369–1374 (2009) 8. YALE Face Data base, http://www.vision.ucsd.edu 9. Reitboeck, H., Brody, T.P.: A Transformation With Invariance Under Cyclic Permutation for Applications in Pattern Recognition. Inf. & Control 15(2), 130–154 (1969) 10. Ojala, T., Pietikǎinen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recognition 29(1), 51–59 (1996) 11. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 971–987 (2002) 12. Ahonen, T., Hadid, A., Pietikäinen, M.: Face Recognition with Local Binary Patterns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481. Springer, Heidelberg (2004)
Optimal Multiclass Classifier Threshold Estimation with Particle Swarm Optimization for Visual Object Recognition Shinko Y. Cheng, Yang Chen, Deepak Khosla, and Kyungnam Kim HRL Laboratories, LLC 3011 Malibu Canyon Road Malibu CA 90265 {sycheng,ychen,dkhosla,kkim}@hrl.com
Abstract. We present a novel method to maximize multiclass classifier performance by tuning the thresholds of the constituent pairwise binary classifiers using Particle Swarm Optimization. This post-processing step improves the classification performance in multiclass visual object detection by maximizing the area under the ROC curve or various operating points on the ROC curve. We argue that the precision-recall or confusion matrix commonly used for measuring the performance of multiclass visual object detection algorithms is inadequate to the Multiclass ROC when the intent is to apply the recognition algorithm for surveillance where objects remain in view for multiple consecutive frames, and where background instances exists in far greater numbers than target instances. We demonstrate its efficacy on the visual object detection problem with a 4-class classifier. Despite this, the PSO threshold tuning method can be applied to all pairwise multiclass classifiers using any computable performance metric.
1
Introduction
This paper introduces a novel method to maximize multiclass classifier performance by tuning the thresholds of the constituent pairwise binary classifiers using Particle Swarm Optimization (PSO) [1]. This post-processing step improves the classification performance in multiclass visual object detection by maximizing the area under the ROC curve or various operating points on the ROC curve. We argue that the precision-recall or confusion matrix commonly used for measuring the performance of multiclass visual object detection algorithms is insufficient to the Multiclass ROC when the intent is to apply the recognition algorithm for surveillance where objects remain in view for multiple consecutive frames, and where background instances exists in far greater
This work was partially supported by the Defense Advanced Research Projects Agency (government contract no. HR0011-10-C-0033) NeoVision2 program. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressly or implied, of the Defense Advanced Research projects Agency of the U.S. Government.
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 536–544, 2011. c Springer-Verlag Berlin Heidelberg 2011
Optimal Multiclass Classifier Threshold Estimation with PSO
537
numbers than target instances. Finally, although we demonstrate its efficacy on the visual object detection problem, this method can be applied to all pairwise multiclass classifiers using any computable performance metric. A well-established method for constructing a multiclass classifier is by training several binary classifiers, one for each pair of classes among M classes [2,3]. A total of M classifiers are trained and applied to a test sample to obtain a 2 prediction. Each pairwise classifier generates a vote for a class. Then, a prediction is made by collecting all the votes and selecting the class with the majority of votes. The problem with this method is the suboptimal choice of thresholds resulting from considering only two classes of samples at a time in training the constituent binary classifiers. Each pairwise classifier optimally divides the samples between only the two classes in the pair. It is possible to obtain a better mean correct classification performance over all classes by increasing the rate of certain classes with only a small decrease in rate of the other classes. The challenge is in estimating a more optimal threshold for each pairwise classifier. We propose to solve this problem by tuning the thresholds to maximize the recognition performance as a function of the Multiclass ROC, using PSO. The advantage of PSO for function optimization is that it does not require computing complex, sometimes infeasible derivatives of the objective function. This enlarges the space of objective functions to use. We demonstrate a threshold tuning algorithm utilizing the geometric mean of the areas under the Multiclass ROC of each class, and separately the geometric mean of the class true-positive rate at particular false-positive rates. Furthermore, since PSO consists of a single update equation, its implementation is very straightforward. The remainder of this paper is organized as follows In sec. 1.1, we will review related metrics in literature for evaluating multiclass visual object detection. We will argue that existing metrics paint an incomplete picture largely due to the multiclass aspect of the underlying model. In sec. 2, we describe threshold tuning using PSO with the objective of maximizing Multiclass ROC performance. In sec. 3, we show the efficacy of the optimization, and conclude with a discussion in sec. 4. 1.1
Related Work
A popular metric for multi-class visual object recognition is precision-recall [4]. This metric stems from the problem of image-retrieval, and uses recall to represent the percentage of positive examples successfully retrieved (detected) and precision to measure the percentage of correct instances retrieved. In video surveillance, these measurements are also very important and relevant. However, false-positive rate – which is the percentage of non-target instances erroneously detected as targets – has a distinct advantage in that surveillance analysts can place an expectation on the number of false-positives per unit time by directly scaling the false-positive rate accordingly. However, precision only indirectly
538
S.Y. Cheng et al.
represents false-positives relative to the proportion of the recalled instances. On the other hand, precision does allow the analyst to place an expectation of the number of detected targets to be erroneous. Variations of the PASCAL VOC metric can be observed in numerous efforts addressing various aspects of the visual object detection problem for surveillance, including Video Analysis and Content Extraction (VACE/CLEAR [5,6]), which pre-dates the PASCAL VOC metric [7], CalTech Pedestrian Detection Benchmark [8] and more recently for events recognition Video Image Retrieval and Analysis Tool (VIRAT) [9]. Despite the level of effort, little attention has been given to performance metrics that captures detection accuracy of a multiclass detector. Much of the effort focused on evaluating tracking performance (VACE/CLEAR). Several consider only the two-class problem (VACE/CLEAR, CalTech Pedestrian) or only partially address performance evaluation of the multiclass detector (PASCAL VOC). Specifically, for the VACE/CLEAR and PASCAL VOC metrics, the performance in detecting multiple categories of objects is measured by treating each detector independent of the other, generating one Spatial Frame Detection Accuracy (SFDA) [5] or precision-recall curve [7] per object. This implies that the system’s computational complexity and the falsepositive rate scales linearly in the number of categories. However, any practical multiclass object recognition system would employ some hierarchical prediction scheme and category confusion resolution mechanism for each detection area; the strengths of the practical system would not be captured by these metrics.
2
Optimal Multi-class Classifiers from Pairwise Binary Classifiers
A multi-class classifier based on pairwise binary classifiers is composed of a pairwise classifier for every combination of pairs among M classes. The set of all pairs of classes can be defined as C M = {12, 13, 14, ..., 1M, 23, 24, 25, ..., (M − 1)M }
(1)
where |C M | = M 2 = M (M − 1)/2. If a pairwise classifier casts a vote between the ith and jth classes, the classification response for a given sample is defined as r(ij) (x), ∀(ij) ∈ C M , and the vector of all pairwise classifier responses is defined as r(x) = (r12 (x), ..., r(M−1)M (x))
(2)
In the process of voting, each response is compared to a threshold tij and a vote is cast for one of the two classes. If the response is greater than or equal to tij , the vote is cast to class i; otherwise, to class j. The class given the most votes among all pairwise classifiers determines to which class the input sample x is predicted to belong.
Optimal Multiclass Classifier Threshold Estimation with PSO
2.1
539
Multiclass ROC
This paper proposes a Multiclass ROC metric for evaluating multi-class classifiers in visual object detection applications. Such classifiers consist of M − 1 distinct target classes and a single background class. The proposed metric is derived from the 2-class ROC, which explicitly plots the true-positive rate for a given false-positive rate, allowing the algorithm designer to tune the classifier to the desired operating point. Each point on the 2-class ROC is an operating point defined by the value of a threshold applied to the classifier responses. In a similar fashion, the Multiclass ROC has the false-positive and true-positive rate as its two axes; however, there are M − 1 curves for each of the M − 1 target classes, and the operating point consists of the set of points on each curve with the same false-positive rate, and the operating point is defined by a set of thresholds applied to the constituent classifiers of the multi-class classifier. Fig. 1 illustrates a conceptual multiclass ROC plot for a multi-class classifier with two target classes and a background class. In the Multiclass ROC, the curves define functions of the class true-positive rate vs. the false-positive rate. If we let P (i|j) be the probability that a sample belonging to class j is predicted as class i, then the class true-positive rate of target class c is defined as Pd (c) = P (c|c)
(3)
The false-positive rate PFA is the probability of predicting a background sample as a target, and is defined as P (c|B) (4) PFA = P (¬B|B) = 1 − P (B|B) = c=B
Another interpretation of Pd and PFA can be derived from the confusion matrix. Each confusion matrix consists of these statistics measured from the predictions made on a (test) dataset with the multiclass classifier and a set of thresholds. The class true-positive rate Pd (c) are the diagonal elements of the confusion matrix for the target classes and the false-positive rate is 1 minus the diagonal element for the background class. Each operating point consists of a pair of measurements (Pd , PFA ). An alternative metric to PFA is false-positives-per-image (FPPI). This metric is also of particular relevance to surveillance, measuring in absolute terms the average Table 1. Operating points in the Multiclass ROC are derived from values in the Confusion Matrix
B B P (B|B) = 1 − PFA Truth C1 P (B|C1 ) C2 P (B|C2 )
Predicted C1 P (C1 |B) P (C1 |C1 ) = Pd (C1 ) P (C1 |C2 )
C2 P (C2 |B) P (C2 |C1 ) P (C2 |C2 ) = Pd (C2 )
540
S.Y. Cheng et al.
$%
$&%
$'%
Fig. 1. Concept of the Multiclass ROC curve. Each curve represents the true-positive and false-positive trade-off for a particular class.
number of false positives detected by the classifier per image, which can be rescaled by the imaging frame rate to obtain how frequent in real time a falsepositive appears. FPPI can be related to PFA by a normalization factor: PFA =
FPPI (Total # background samples/Total # frames)
(5)
To obtain a Multiclass ROC curve, a series of operating points are calculated by using several sets of thresholds for the classifier. In the 2-class case, only a single threshold was required to be varied. In the multi-class case, we vary the thresholds for the of pairwise classifiers that has the potential to vote for the background class. Each step between two threshold values is made equal. More precisely, the pairwise-classifier thresholds are given by t = t0 + Δt, where t0 is the starting threshold value learned with PSO Threshold Tuning described in the sec. 2.2, and swept by the term t if i = B or j = B (6) Δtij = 0 if i = B and j = B where t ∈ R is varied over the range of classifier responses. 2.2
PSO Threshold Tuning
Because the pairwise classifiers are trained having only considered the samples of two classes and disregarded all of the other classes, subsets of thresholds may be raised or lowered to bias the classifier towards certain classes, e.g. background class in order to minimize false-positives. Furthermore, a bias may raise the truepositive rate for one class at only a slight cost of lowering the true-positive rate for the other classes. As a result, the average true-positive rate among all classes may be raised by optimizing the set of thresholds t = (tij ), ∀ij ∈ C M . Particle Swarm Optimization is used to find the optimal set of thresholds t∗ that maximizes a function of multiclass classification performance. Namely, we solve for t∗ = arg max f (t, R) t
(7)
Optimal Multiclass Classifier Threshold Estimation with PSO
input : M - Number of classes (c1, c2, w) - PSO damping factors N - Number of swarming iterations P - Number of swarm particles R - Set of pairwise classification responses f (t, R) - Objective Function A - Max absolute value classifier responses. output: t∗ - Optimal Thresholds // Swarm Initialization dim ← M (M − 1)/2 ; gconf ← −∞ ; forall i ∈ P do vi ← 0; lconf,i ← −∞ ; xi ∼ U dim [−A, A]; end // Simulate Swarm for n ← 1 to N do for i ← 1 to P // Update local-best li , ∀i do if lconf,i < xconf,i then lconf,i ← xconf,i ; l i ← xi ; end end for i ← 1 to P // Update global-best g do if gconf < lconf,i then gconf ← lconf,i ; gi ← li ; end end for i ← 1 to P // Update particles xi , ∀i do r1 ∼ U dim ; r2 ∼ U dim ; vi ← w · vi + c1 r1 (li − xi ) + c2 r2 (g − xi ); xi ← xi + vi ; xconf,i ← f (xi , R); end end return g; Algorithm 1. PSO Threshold Tuning
541
542
S.Y. Cheng et al.
where R = [r(x1 ), r(x2 ), ...r(xN )] consists of all pairwise classifier responses from all N samples. PSO was chosen for its simplicity in implementation, flexibility in the kinds of objective functions that can be used, and superior convergence properties compared to similar techniques such as the simplex method or simulated annealing [1]. We define two functions of classification performance based on area under the ROC curve Az (equ. 8) and the true-positive rate Pd (equ. 9). fAz (t) =
M
Az(c, t, R)
(8)
Pd (c, t, R)
(9)
c=1
fPd (t) =
M c=1
Both Az(c, t, R) and Pd (c, t, R) are based on the multiclass ROC metric. Equ. 8 is a function of the areas under the receiver-operating-characteristics curve, while equ. 9 is a function of the true-positive rate Pd for a given false-positive rate PFA . The subscript denotes the class. The objectives using the products rather than the sums of factors would favour a solution that all factors be equally high. The first set of objectives using Az tunes the thresholds to obtain the best classifiers over all operating points, while the second set of objectives using Pd obtains the best classifiers for a given operating point. The PSO Threshold Tuning routine is given in alg. 1.
3
Experimental Evaluation
An example output from an unoptimized multiclass ROC for a multiclass classifier is shown in fig. 2a. After applying the PSO Threshold Tuning routine with Table 2. Threshold tuning results (a) Mean Az improved following threshold tuning Baseline Vehicle Pedestrian Bike G.Mean
Az Optimized Az Δ 0.8924 0.5904 (0.3020) 0.5257 0.8635 0.3378 0.5313 0.7350 0.2037 0.6498 0.7296 0.0798
(b) Mean true-positive rate improved following threshold tuning Baseline Vehicle Pedestrian Bike G.Mean
Pd Optimized Pd Δ 0.8806 0.6439 (0.2367) 0.5455 0.7273 0.1818 0.5094 0.7170 0.2076 0.6452 0.6961 0.0509
Optimal Multiclass Classifier Threshold Estimation with PSO
(a)
(b)
543
(c)
Fig. 2. Multiclass ROC of a 4-class classifier before and after PSO Threshold-offset tuning. (a) Before tuning. (b) After tuning with objective (2). (c) After tuning with objective (4)
objective function given by equ. 8, the resulting multiclass ROC and class-Az can be seen in fig. 2b and tab. 2a. The geometric mean of areas under the ROC curve Az is used as the objective function for this instance and the tuning procedure raised mean from 0.6294 to 0.7209. As shown in Table 1, the result of tuning lowered the Az for the vehicle class by 0.3020, but it raised the Az for the bike class by 0.2037 and the pedestrian class by 0.3378, which more than the amount lost for the vehicle class. After applying the PSO Threshold-Offset Tuning routine with the objective given by equ. 9, the resulting multiclass ROC and Pd can be seen in fig. 2c and tab. 2a. The geometric mean of the true-positive rates Pd at 10% false alarm rate is used as the objective function for this instance. The tuning procedure raised the mean value from 0.1604 to 0.2420. Here, the cost of losing 23% Pd in the vehicle is offset by more than 18% increase in Pd for both pedestrian and bike classes.
4
Conclusion
We introduced in this paper a novel method to maximize the multiclass classifier performance by tuning the thresholds of the constituent pairwise binary classifiers using Particle Swarm Optimization. We showed that this post-processing step following classifier training improves the performance with respect to the area under the ROC curve or various true-positive rates for a given false-positive rate. In the process, we argued for the use of the Multiclass ROC and demonstrated the efficacy of this approach on a 4-class object classifier.
References 1. Eberhart, R.C., Shi, Y., Kennedy, J.: Swarm Intelligence. Academic Press, London (2001) 2. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Chichester (2001)
544
S.Y. Cheng et al.
3. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Data Mining, Inference, and Prediction, 2nd edn. Springer Series in Statistics (2009) 4. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge 88, 303–338 (2010) 5. Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers, R., Boonstra, M., Korzhova, V., Zhang, J.: Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 319–336 (2009) 6. Ellis, A., Ferryman, J.M.: PETS2010 and PETS2009 evaluation of results using individual ground truthed single views. In: IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 135–142 (2010) 7. Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge). International Journal of Computer Vision 88, 303–338 (2010) 8. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: Computer Vision and Pattern Recognition (2009) 9. Oh, S., Perera, A., Cuntoor, N., Chen, C.C., Lee, J.T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., RoyChowdhury, A., Desai, M.: A large-scale benchmark dataset for event recognition in surveillance video. In: IEEE Comptuer Vision and Pattern Recognition (2011) 10. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)
A Parameter-Free Locality Sensitive Discriminant Analysis and Its Application to Coarse 3D Head Pose Estimation A. Bosaghzadeh1 and F. Dornaika1,2 1 Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, San Sebastian, Spain 2 IKERBASQUE, Basque Foundation for Science, Bilbao, Spain
Abstract. In this paper we propose a novel parameterless approach for discriminative analysis. By following the large margin concept, the graph Laplacian is split in two components: within-class graph and betweenclass graph to better characterize the discriminant property of the data. Our approach has two important characteristics: (i) while all spectralgraph based manifold learning techniques (supervised and unsupervised) are depending on several parameters that require manual tuning, ours is parameter-free, and (ii) it adaptively estimates the local neighborhood surrounding each sample based on the data similarity. Our approach has been applied to the problem of modeless coarse 3D head pose estimation. It was tested on two databases FacePix and Pointing’04. It was conveniently compared with other linear techniques. The experimental results confirm that our method outperforms, in general, the existing ones. Although we have concentrated in this paper on coarse 3D head pose problem, the proposed approach could also be applied to other classification tasks for objects characterized by large variance in their appearance.
1
Introduction
In most computer vision and pattern recognition problems, the large number of sensory inputs, such as images and videos, are computationally challenging to analyze. In such cases it is desirable to reduce the dimensionality of the data while preserving the original information in its distribution, allowing for more efficient learning and inference. The fundamental issue in dimensionality reduction is how to model the geometry structure of the manifold and produce a faithful embedding for data projection. During the last few years, a large number of approaches have been proposed in order to compute the embedding of high dimensional spaces. We categorize these methods by their linearity. The linear methods, such as Principal Component Analysis (PCA) [1] and Multidimensional Scaling (MDS) [2], are evidently effective in observing the Euclidean structure. PCA projects the samples along the directions of maximal variances and aims to preserve the Euclidean distances between the samples. Unlike PCA which is unsupervised, Linear Discriminant Analysis (LDA) [3] is a supervised technique. One limitation of PCA and LDA is that they only see the linear G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 545–554, 2011. c Springer-Verlag Berlin Heidelberg 2011
546
A. Bosaghzadeh and F. Dornaika
global Euclidean structure. However, recent researches show that the samples may reside on a nonlinear submanifold, which makes PCA and LDA inefficient. The nonlinear methods such as Locally Linear Embedding (LLE) [4], Laplacian Eigenmaps [5], and Isomap [6], focus on preserving the local structures. Therefore, there is a considerable interest in geometrically motivated dimensionality reduction approaches. Many works have considered the case where data live on or are close to a low dimensional submanifold of the high dimensional ambient space [4,6,7]. One hopes then to estimate geometrical and topological properties of the submanifold from random points lying on this unknown submanifold. Linear Dimensionality Reduction (LDR) techniques have been increasingly important in pattern recognition [8] since they permit a relatively simple mapping of data onto a lower-dimensional subspace, leading to simple and computationally efficient classification strategies. The goal of all dimensionality reduction techniques is to find a new set of features that represents the target concept in a more compact and robust way but also to provide more discriminative information. Many dimensionality reduction techniques can be derived from a graph whose nodes represent the data samples and whose edges quantify the similarity among pairs of samples [9]. LPP is a typical graph-based LDR method that has been successfully applied in many practical problems. LPP is essentially a linearized version of Laplacian Eigenmaps [5]. In [10], the authors proposed a linear discriminant method called Average Neighbors Margin Maximization (ANMM). It associates to every sample a margin that is set to the difference between the average distance to heterogeneous neighbors and the average distance to the homogeneous neighbors. The linear transform is then derived by maximizing the sum of the margins in the embedded space. A similar method based on similar and dissimilar samples was proposed in [11]. In many linear embedding techniques, the neighborhood relationship is measured by an artificial constructed adjacent graph. Usually, the most popular adjacent graph construction manner is based on the K nearest neighbor and εneighborhood criteria. Once an adjacent graph is constructed, the edge weights are assigned by various strategies such as 0-1 weights and heat kernel function. Unfortunately, such adjacent graph is artificially constructed in advance, and thus it does not necessarily uncover the intrinsic local geometric structure of the samples. The performance of the technique is seriously sensitive to the neighborhood size K. As can be seen, all graph-based linear techniques require several parameters that should set in advance or tuned empirically using tedious cross-validation processes. Many existing works consider the neighborhood size as a user-defined parameter. It is set in advance to the same value for all samples. Moreover, some discriminant linear techniques employ an additive objective function that include a balance parameter that should also be determined. In [12], the authors proposed a method called Locality Sensitive Discriminant Analysis. It computes a linear mapping that simultaneously maximizes the local margin between heterogeneous samples and pushes the homogeneous samples closer to each other. In this paper, we introduce two main enhancements to the
A Parameter-Free Locality Sensitive Discriminant Analysis
547
algorithm proposed in [12]: (i) we adaptively estimate the local neighborhood surrounding each sample based on data density and similarity, and (ii) we use quotient objective function in order to compute the linear embedding. These two enhancements make the algorithm parameter-free. Besides, we apply the proposed method to the problem of coarse 3D head pose estimation. The remainder of the paper is organized as follows. Section 2 describes the proposed parameter-free Locality Sensitive Discriminant Analysis. Section 3 presents the application which deals with coarse 3D head pose estimation from images. Section 4 presents some experimental results obtained with two databases: FacePix and Pointing’04. Throughout the paper, capital bold letters denote matrices and small bold letters denote vectors.
2 2.1
Proposed Locality Sensitive Discriminant Analysis Two Graphs and Adaptive Set of Neighbors
D We assume that we have a set of N labeled samples {xi }N i=1 ⊂ R . In order to discover both geometrical and discriminant structure of the data manifold, we build two graphs: the within-class graph Gw and between-class graph Gb . Let l(xi ) be the class label of xi . For each data point xi , we compute two subsets, Nb (xi ) and Nw (xi ). Nw (xi ) contains the neighbors sharing the same label with xi , while Nb (xi ) contains the neighbors having different labels. We stress the fact that unlike the classical methods for neighborhood graph reconstruction, our algorithm adapts the size of both sets according to the local sample point xi and its similarities with the rest of samples. Instead of using a fixed size for the neighbors, each sample point xi will have its own adaptive set of neighbors. The set is computed in two consecutive steps. First, the average similarity of the sample xi is computed by the total of all similarities with the rest of the data set (Eq. (1)). Second, the sets Nw (xi ) and Nb (xi ) are computed using Eqs. (2) and (3), respectively.
AS(xi ) =
N 1 sim(xi , xk ) N
(1)
k=1
In Eq. (1), sim(xi , xk ) is a real value that encodes the similarity between xi and xk . It belongs to the interval [0, 1]. Simple choices for this function are the heat kernel and the cosine. A high value for AS(xi ) means that the sample has a lot of similar (close) samples. A very low value means that this sample has very few similar (close) samples. Nw (xi ) = {xj | l(xj ) = l(xi ), sim(xi , xj ) > AS(xi )}
(2)
Nb (xi ) = {xj | l(xj ) = l(xi ), sim(xi , xj ) > AS(xi )}
(3)
Equation (2) means that the set of within-class neighbors of the sample xi , Nw (xi ), is all data samples that have the same label of xi and that have a similarity higher than the average similarity associated with xi . There is a similar
548
A. Bosaghzadeh and F. Dornaika
interpretation for the set of between-class neighbors Nb (xi ). From Equations (2) and (3) it is clear that the neighborhood size is not the same for every data sample. This mechanism adapts the set of neighbors according to the local density and similarity between data samples in the original space. Since the concepts of similarity and closeness of samples are tightly related, one can conclude, at first glance, that our introduced strategy is equivalent to the use of an ε-ball neighborhood. It is worth noting that there are two main differences: i) the use of an ε-ball neighborhood requires a user-defined value for the ball radius ε, and (ii) the ball radius is constant for all data samples, whereas in our strategy the threshold (1) depends on the local sample. Each of the graphs mentioned before, Gw and Gb , is characterized by its corresponding affinity (weight) matrix Ww and Wb , respectively. The matrices are defined by the following formulas: sim(xi , xj ) if xj ∈ Nw (xi ) or xi ∈ Nw (xj ) Ww,ij = 0, otherwise 1 if xj ∈ Nb (xi ) or xi ∈ Nb (xj ) Wb,ij = 0, otherwise 2.2
Optimal Mapping
A linear embedding technique is described by a matrix transform that maps the original samples xi into low dimensional samples AT xi . The number of columns of A defines the new dimension. We aim to compute a linear transform, A, that simultaneously maximizes the local margins between heterogenous samples and pushes the homogeneous samples closer to each other (after the transformation). Mathematically, this corresponds to: 1 AT (xi − xj )2 Ww,ij (4) min A 2 i,j 1 max AT (xi − xj )2 Wb,ij (5) A 2 i,j Using simple matrix algebra, the above criteria become respectively: Jhomo = = = Jhete = = =
1 AT (xi − xj )2 Ww,ij 2 i,j tr AT X (Dw − Ww ) XT A tr AT X Lw XT A 1 AT (xi − xj )2 Wb,ij 2 i,j tr AT X (Db − Wb ) XT A tr AT X Lb XT A
(6) (7) (8) (9) (10) (11)
A Parameter-Free Locality Sensitive Discriminant Analysis
549
where X = (x1 , x2 , . . . , xN ) is the data matrix, Dw denotes the diagonal weight matrix, whose entries are column (or row, since Ww is symmetric) sums of Ww , and Lw = Dw − Ww denotes the Laplacian matrix associated with the graph Gw . Given two individual optimization objectives as in Eq. (4) and Eq. (5) , we may construct a difference criterion to maximize: J = α Jhete − (1 − α) Jhomo
(12)
where 0 < α < 1 is a tradeoff parameter. In practice, however, it is hard and tedious to choose an optimal value for this parameter. Therefore, instead of using the difference criterion (12), we formulate the objective as a quotient so that α is removed as follows: b A tr AT X Lb XT A tr AT S Jhete = = J= (13) Jhomo w A tr AT X Lw XT A tr AT S b = X Lb XT denotes the locality preserving bewhere the symmetric matrix S w = X Lw XT denotes tween class scatter matrix, and the symmetric matrix S the locality preserving within class scatter matrix. The trace ratio optimization problem (13) can be replaced by the simpler yet inexact trace form: −1 T T A Sw A A Sb A max tr (14) A The above optimization problem has a closed form solution due to the quadratic form. The columns of the sought matrix A are given by the generalized eigenvectors associated with the largest eigenvalues of the following equation: w A Λ b A = S S where Λ is the diagonal matrix of eigenvalues. In many real world problems such as face recognition, both matrices XLb XT and XLw XT can be singular. This stems from the fact that sometimes the number of images in the training set, N , is much smaller than the number of pixels in each image, D. To overcome the complication of singular matrices, original data are first projected to a PCA subspace or a random orthogonal space so that the resulting matrices XLb XT and XLw XT are non-singular.
3 3.1
Coarse 3D Head Pose Background
The majority of work in 3D head pose estimation deals with tracking full rigid body motion (6 degrees of freedom) for a limited range of motion (typically +/-45 out-of-plane) and relatively high resolution images. Besides, such systems
550
A. Bosaghzadeh and F. Dornaika
typically require a 3D model [13,14] as well as its initialization. There is a tradeoff between the complexity of the initialization process, the speed of the algorithm and the robustness and accuracy of pose estimation. Although the model-based systems can run in real-time, they rely on frame-to-frame estimation and hence are sensitive to drift and require relatively slow and non-jerky motion. These systems require initialization and failure recovery. For situations in which the subject and camera are separated by more than a few feet, full rigid body motion tracking of fine head pose is no longer practical. In this case, model-less coarse pose estimation can be used. It can be performed on a single image at any time without any model given that some pose-classified ground truth data are learned a priori. Coarse 3D pose estimation can play an important role in many applications. For instance, it can be used in the domain of face recognition either by using hierarchical models or by generating a frontal face image.
FacePix
Pointing’04 Fig. 1. Some samples in FacePix and Pointing’04 data sets
3.2
Databases
We evaluate the proposed methods with experiments on two public face data sets for face recognition and pose estimation.
A Parameter-Free Locality Sensitive Discriminant Analysis
551
1. The FacePix database includes a set of face images with pose angle variations. It is composed of 181 face images (representing yaw angles from −90◦ to +90◦ at 1 degree increments) of 30 different subjects, with a total of 5430 images. All the face images (elements) are 128 pixels wide and 128 pixels high. These images are normalized, such that the eyes are centered on the 57th row of pixels from the top, and the mouth is centered on the 87th row of pixels. The upper part of Figure 1 provides examples extracted from the database, showing pose angles ranging from −90◦ to +90◦ in steps of 10◦ . In our work, we downsample the set and only keep 10 poses in steps of 20◦ . 2. Pointing’04 Head-Pose Image Database consists of 15 sets of images for 15 subjects, wearing glasses or not and having various skin colors. Each set contains two series of 93 images of the same person at different poses (lower part of Figure 1). In our work, we combine the two series into one single data set so that we can carry out tests on random splits. The pose or head orientation is determined by the pan and tilt angles, which vary from −90◦ to 90◦ in steps of 15◦ . Each pose has 30 images. The ground truth data for this database are not as accurate as FacePix data set. Indeed, the method used for generating this data set belongs to directional suggestion category which assumes that each subject’s head is in the exact same physical location in 3D space. Furthermore, it assumes that persons have the ability to accurately direct their head towards an object. The effect of this limitation will be obvious in the experimental results obtained with Pointing’04 data set. 3.3
Experimental Results and Method Comparison
As mentioned earlier, the problem of coarse 3D head pose estimation can be cast into a classification problem. Estimating the pose class of a test face image is carried out in the new low dimensional space using the Nearest Neighbor classifier. For FacePix database, we have 10 different classes, each with 30 subjects. For each pose, l images are randomly selected for training and the rest are used for testing. For each given l, we average the results over 14 random splits. For every split, the pre-stage of dimensionality reduction (classical PCA) retained the top eigenvectors that correspond to 95% of the total variability. In general, the recognition rate varies with the dimension retained by the embedding method. In all our experiments, we recorded the best recognition rate for each algorithm. Figure 2 depicts the recognition rate associated with two different criteria: the difference criterion (12) and the quotient criterion (14) when applied on FacePix database. These recognition rates depict the average recognition rates over 14 random splits of the data. The test sets were formed by unseen subjects. The number of training images l in Figures (a), (b), (c), and (d) were respectively, 5, 10, 15, and 20. As can be seen, the quotient criterion performed better than the difference criterion over all dimensions used. For the difference criterion, several trials have been performed in order to choose the optimal value for the parameter α. The results in Figure 2 correspond to those giving the best recognition rate in test sets. In the sequel, we report results obtained with the quotient criterion only.
552
A. Bosaghzadeh and F. Dornaika
Table 1. Best average recognition accuracy (%) on FacePix set over 14 random splits. Each column corresponds to a fixed number of training images. The number appearing in parenthesis corresponds to the optimal dimensionality of the embedded subspace (at which the maximum average recognition rate has been reported). l PCA LPP ANMM Proposed Method (fixed neighbors) Proposed Method (adaptive neighbors)
25 87.0% (30) 83.2% (20) 89.7% (15) 90.2% (25) 91.7% (10)
20 86.2% (30) 79.9% (20) 87.8% (10) 88.5% (20) 89.6% (10)
(a)
(b)
(c)
(d)
15 83.9% (30) 77.8 % (15) 88.8% (10) 88.2% (20) 88.1% (10)
Fig. 2. Average recognition rate for the quotient and difference criteria obtained with FacePix face data set. The number of training images in Figures (a), (b), (c), and (d) were respectively, 5, 10, 15, and 20.
Table 1 shows the recognition rates for different algorithms and for different l. The algorithms used are: PCA, LPP, ANMM, the proposed algorithm with fixed neighborhood size (fourth row), and the proposed algorithm with adaptive neighbor sets (fifth row). As can be seen, our algorithm achieved 91.7% recog-
A Parameter-Free Locality Sensitive Discriminant Analysis
553
Table 2. Best average recognition accuracy (%) on Pointing’04 data set for pitch and yaw angles (over 10 random splits). The training sets contained 20 images.
PCA LPP ANMM Proposed Method (fixed neighbors) Proposed Method (adaptive neighbors)
Pitch 46.5% (70) 45.3% (40) 48.8% (70) 45.1% (70) 52.5% (50)
Yaw 47.8% (70) 44.9% (20) 50.3% (70) 44.7% (70) 49.9% (30)
nition rate when 25 face images per pose/class were used for training, which is the best out of the four algorithms (PCA, LPP, ANMM, proposed method). We stress the fact that ANMM is one of the most powerful Linear Discriminant Analysis methods. It should be noticed that the performance of the proposed method is very close to that of ANMM. However, unlike ANMM which needs two user-defined parameters, our proposed method does not need any parameter setting. As can be seen, the variant with adaptive neighbors can be superior to the variant with fixed neighborhood size (See fourth and fifth rows). Table 2 shows the recognition rates for pitch and yaw angles obtained with PCA, LPP, ANMM, and the proposed method when applied on Pointing’04 data set. For these methods, the linear mapping was learned using the 93 classes (poses). The recognition rates were computed separately for the pitch and yaw angles for all test images. The training set contained 20 images. As can be seen, our proposed method achieved the best performance. The recognition rates were relatively low since the ground truth data associated with Pointing’04 database were not accurate.
4
Conclusion
We developed a discriminant linear local subspace learning method. We applied it for coarse 3D head pose estimation. Unlike graph-based linear embedding techniques, our proposed method does not need user-defined parameters. Experimental results demonstrate the advantage over some state-of-art solutions. Acknowledgment. This work was supported by the Spanish Government under the project TIN2010-18856.
References 1. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 33(1), 71–86 (1991) 2. Borg, I., Groenen, P.: Modern Multidimensional Scaling: theory and applications. Springer, Heidelberg (2005) 3. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1990)
554
A. Bosaghzadeh and F. Dornaika
4. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000) 5. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15, 1373–1396 (2003) 6. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000) 7. Saul, L.K., Roweis, S.T., Singer, Y.: Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2003) 8. Yan, S., Xu, D., Zhang, B., Zhang, H., Yang, Q., Lin, S.: Graph embedding and extension: a general framework for dimensionality reduction. IEEE Trans. on Pattern Analysis and Machine Intelligence 29, 40–51 (2007) 9. Yan, S., Xu, D., Zhang, B., Zhang, H.J.: Graph embedding: A general framework for dimensionality reduction. In: Int. Conference on Computer Vision and Pattern Recognition (2005) 10. Wang, F., Wang, X., Zhang, D., Zhang, C., Li, T.: Marginface: A novel face recognition method by average neighborhood margin maximization. Pattern Recognition 42, 2863–2875 (2009) 11. Alipanahi, B., Biggs, M., Ghodsi, A.: Distance metric learning vs. Fisher discriminant analysis. In: AAAI Conference on Artificial Intelligence (2008) 12. Cai, D., He, X., Zhou, K., Han, J., Bao, H.: Locality sensitive discriminant analysis. In: International Joint Conference on Artificial Intelligence (2007) 13. Dornaika, F., Ahlberg, J.: Face and facial feature tracking using deformable models. International Journal of Image and Graphics 4, 499–532 (2004) 14. Dornaika, F., Davoine, F.: On appearance based face and facial action tracking. IEEE Transactions on Circuits and Systems for Video Technology 16, 1107–1124 (2006)
Image Set-Based Hand Shape Recognition Using Camera Selection Driven by Multi-class AdaBoosting Yasuhiro Ohkawa, Chendra Hadi Suryanto, and Kazuhiro Fukui Graduate School of Systems and Information Engineering, University of Tsukuba, Japan {[email protected],[email protected],kfukui@cs}.tsukuba.ac.jp
Abstract. We propose a method for image set-based hand shape recognition that uses the multi-class AdaBoost framework. The recognition of hand shape is a difficult problem, as a hand’s appearance depends greatly on view point and individual characteristics. Using multiple images from a video camera or a multiple-camera system is known to be an effective solution to this problem. In our proposed method, a simple linear mutual subspace method is considered as a weak classifier. Finally, strong classifiers are constructed by integrating the weak classifiers. The effectiveness of the proposed method is demonstrated through experiments using a dataset of 27 types of hand shapes. Our method achieves comparable performance to the kernel orthogonal mutual subspace method, but at a smaller computational cost.
1 Introduction In this paper, we propose a hand shape recognition method that uses sets of image patterns captured by a multiple-camera system. By introducing camera selection based on the multi-class AdaBoost framework, the proposed method can classify the nonlinear distributions of input images effectively. Computational complexity is reduced as the method is based only on linear classifiers. Hand gestures are often used in our daily life to facilitate communications with another person. Therefore, it is also expected that hand gestures can be used to achieve a more natural interaction between humans and computer systems. To recognize hand gestures automatically, the recognition of the three-dimensional shape of a hand is the most elementary requirement. Many types of hand shape recognition methods have been proposed. They can be divided into two categories: model-based methods and appearance-based methods[1]. Model-based methods use a three-dimensional hand model for recognition[2,3]. They extract feature points such as edges and corners of hand images and match them to a three-dimensional hand model. For example, Imai has proposed a method for estimating hand posture in three dimensions by matching the edges extracted from a hand image to the silhouette generated from a typical hand model[3]. Although the model-based methods are widely used in various trial systems, they often suffer from unstable matching and high computational complexity, since a hand is a complex three-dimensional object with 20 degrees of freedom [1]. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 555–566, 2011. c Springer-Verlag Berlin Heidelberg 2011
556
Y. Ohkawa, C.H. Suryanto, and K. Fukui
● ● ●
PCA
Canonical Angles θ1,2,…,M
● ● ●
PCA
Fig. 1. Conceptual diagram of MSM. The distributions of multiple-viewpoint image sets of hands are represented by linear subspaces, which are generated by PCA. The canonical angles between two subspaces are used as a measure of the similarity between the distributions.
On the other hand, appearance-based methods [4,5,6,7] classify a hand shape from its appearance, where an n×n pixel pattern is treated as vector x in a n2 -dimensional space. These methods can deal with the variation of appearances due to changes of viewpoint, illumination and differences between individuals by preparing a static model representing these variations. The mutual subspace method (MSM)[8] is one of the most suitable and efficient appearance-based methods for recognizing hand shape. The novelty of MSM is its ability to handle multiple sets of images effectively. MSM represents a set of patterns {x} of each class by a low-dimensional linear subspace in high-dimensional vector space using the Karhunen-Lo`eve (KL) expansion, which is also known as principal component analysis (PCA). By introducing the subspace-based representation, the similarity between two sets of patterns can be easily obtained from canonical angles θi between subspaces, as shown in Fig.1. MSM is better able to deal with variations of appearance due to changes of view point than are conventional methods using a single input image, such as the k-NN method. However, the classification ability of MSM declines considerable when the distribution of patterns has a nonlinear structure, such as that captured through a multiple-camera system. To overcome this problem, MSM has been extended to a nonlinear method, called the kernel mutual subspace method (KMSM) [9,10]. Further, to boost classification ability, KMSM has been extended to the kernel orthogonal MSM (KOMSM) by adding the orthogonal transformation of class subspaces [11]. The ability of KOMSM to classify multiple sets of image patterns is as good as or better than other extensions of MSM [12,13,14,15,16]. KOMSM has also been demonstrated to be effective for hand recognition[7]. However, KOMSM has the serious problem that its computational cost and memory size requirements increase in proportion to the number of learning patterns and classes. In particular, the generation of the orthogonal transformation matrix, which is an essential component of KOMSM, is almost impossible when these numbers are large. The problem is difficult, even for the distribution of patterns from a single camera. Thus,
Image Set-Based Hand Shape Recognition
557
we can hardly apply KOMSM to a multiple-camera system, although the distribution of patterns obtained from the multiple-camera system contains more fruitful information about hand shape. This problem of computational complexity cannot be completely solved even if the method of reduction [7] by k-means clustering or the incremental method [17] is applied. Accordingly, we propose an alternative approach based on the framework of ensemble learning[18] without using the kernel trick. In the proposed method, we regard a classifier based on the MSM as a weak classifier. When applying the framework of ensemble learning to our problem, the method of generating various types of weak MSM-based classifiers is an important issue to be considered. We are able to achieve better performance than that of the original MSM method by generating the classifiers of ensemble learning from each camera, but performance is still far below that of nonlinear methods, such as KOMSM. This is because the classifiers generated from each camera hold an information obtained from a local viewpoint, but they do not hold the combination which contains richer information about the distribution. In contrast, KOMSM is able to encode the complete appearance of the image patterns in the nonlinear subspace. Therefore, we consider generating classifiers from all possible combinations of the multiple cameras so that we can obtain classifiers with a richer pattern distribution by combinations of camera selection. Thus, the number of classifiers increases from n to 2n − 1, where n is the number of cameras installed for ensemble learning. It is difficult to determine suitable dimensions for an input subspace and reference subspaces. Thus, we add the dimension selection to the camera selection in the above framework. This additional process increases the number of the combinations substantially. However, such combinations may include ineffective classifiers and the computational cost with all the combinations is very high. Therefore, we select the best combinations from these using multi-class AdaBoost[19]. The rest of this paper is organized as follows. In Section 2, we explain the method for camera selection based on the multi-class AdaBoost. In Section 3, we explain the process flow of the proposed method. In Section 4, the effectiveness of our method is demonstrated through evaluation experiments with actual multiple-image sequences. Section 5 presents our conclusions.
2 Proposed Method Based on Multi-class AdaBoost In this section, we first explain the construction of weak classifiers that are effective for image set-based recognition using multiple cameras. Then, we explain the recognition of multiple-view images based on the MSM. Finally, we propose a method for selecting valid weak MSM-based classifiers from all the possible classifiers using the multi-class AdaBoost. 2.1 Generating Weak Classifiers Figure 2 shows the concept of the proposed method for generating weak classifiers from a combination of selections from five cameras. First, the hand shape images are captured by the multiple-camera system. Next, we construct sets of combined images from the
558
Y. Ohkawa, C.H. Suryanto, and K. Fukui Capture by Multiple Cameras
Selection of camera combinations
PCA PCA PCA Changing also the dimension of subspace to be generated Weak Classifier:1-2 Weak Classifier:1-1
Weak Classifier:2-2 Weak Classifier:2-1
Weak Classifier:3-2 Weak Classifier:3-1
PCA
Weak Classifier:31-2 Weak Classifier:31-1
Selection based on multi-class AdaBoost
Fig. 2. Conceptual diagram of proposed method: Various weak classifiers are generated by changing the combinations of the cameras used for recognition and the dimensions of reference subspaces. Valid weak classifiers are selected using multi-class AdaBoost learning.
five cameras. Since we employ five cameras, the number of the camera combinations is 31(= 25 − 1). Finally, weak classifiers are constructed by employing MSM to classify the sets of the combined images with various combinations of subspace dimensions. As not all of the weak classifiers are capable of constructing strong classifiers, we use multi-class AdaBoost to select the valid ones. 2.2 Mutual Subspace Method In MSM, the distributions of reference patterns and input patterns are represented by linear subspaces, which are generated by principal component analysis (PCA). Then, the canonical angles between the two subspaces are used as a measure of the similarity between the distributions. Definition of canonical angles between two subspaces. The canonical angles can be calculated as follows. Given the M1 -dimensional subspace P1 and the M2 -dimensional subspace P2 in D-dimensional feature space, the M1 canonical angles {0 ≤ θ1 , . . . , θM1 ≤ π2 } between P1 and P2 (for convenience M1 ≤ M2 ) are uniquely defined by (ui · vi )2 cos2 θi = max , (1) ui ⊥uj , vi ⊥vj ||ui ||2 ||vi ||2 1≤i,j≤M, i=j
where (·) denotes inner product and || · || denotes the norm of a vector.
Image Set-Based Hand Shape Recognition
559
Algorithm 1. Selection of weak MSM classifier based on multi-class AdaBoost 1: Given example input-subspaces P1 , . . . , PN , and class-labels c1 , . . . , cN where cn = 1, . . . , K. F (l) indicates l-th weak classifier, which outputs a value of 1, . . . , K. 2: Initialize the weights wn = 1/N, n = 1, 2, . . . , N. 3: for m = 1 to M do 4: (a) Compute of each weak classifier the weighted error (l) w (c = F (P err (l) = N n n n )), l = 1, . . . L. n=1 5: (b) Select the weak classifier with the minimum error as mth weak classifier T (m) T (m) ← F (arg minl err) . 6: (c) Compute the reliability αm from the weighted error of mth weak classifier T (m) by + log(K − 1). αm = log 1−err err 7: (d) Update the weights: wn = wn exp(α(m) (cn = T (m) (Pn ))), n = 1, . . . N. 8: (e) Normalize the weights: wn . wn ← N wn i 9: end for 10: output m · (T (m) (P) = k). C(P) = arg maxk M m=1 α
A practical method of finding the canonical angles is by computing the M1 × M2 matrix C = V1 V2 , , V1 = [v 11 , . . . , v 1M1 ] ,
(2)
V2 = [v 21 , . . . , v 2M2 ] , where v 1s and v 2s denote the s-th D-dimensional orthonormal basis vectors of the subspaces P1 and P2 , respectively. The canonical angles {θ1 , . . . , θM1 } are the arccosine {arccos(κ1 ), . . . , arccos(κM1 )} of the singular values{κ1, . . . , κM1 } of the matrix C. Similarity between two subspaces. From these angles, we calculate the Mcanonical 1 similarity between two subspaces as S = M11 m=1 cos2 θm . If the two subspaces coincide completely, S is 1.0, since all canonical angles are 0. The similarity S becomes smaller as the two subspaces separate. Finally, the similarity S is zero when the two subspaces are orthogonal to each other. 2.3 Selection of Valid MSM-Based Weak Classifiers by Multi-class AdaBoost Five cameras were used for the recognition. Therefore, the number of multiple-camera combinations is 31(= 25 − 1). Among those combinations, unnecessary weak classifiers are discarded and valid weak classifiers are selected by multi-class AdaBoost to generate the strong classifier. Algorithm 1 shows the detailed process of MSM-based weak classifier selection based on multi-class AdaBoost.
560
Y. Ohkawa, C.H. Suryanto, and K. Fukui Combination of cameras:
L-2
L-1
Combination of dimensions: d
WC
WC 1
1
WC
…
WC
WC 2
2
…
…
…
…
WC
WC M
…
2
WC
…
1
WC n
n-camera system
N weak classifiers Calculate R-3 similarities
R-2 R-1
…
2
n-camera system
n Input set
M M input subspaces
R-4 1
WC 2
Vote the weights
2
R-5
…
2
M
M selected classifiers and weights
WC 1
…
1
…
1
WC: Weak classifier
L-3
Class of the input set
M
WC M 1 2 3… C
1 2 3… C
Class number
Class number
Fig. 3. The flow of the hand shape recognition process based on the proposed method.
3 Flow of Hand Shape Recognition Based on the Proposed Method Figure 3 shows the flow of the recognition process based on the proposed framework. The whole process is divided into a learning phase and a recognition phase. Learning phase L-1: Collect n image sequences of each hand shape using a n-camera system, where n is the number of cameras installed and the number of class is C. L-2: Generate N (= (2n −1)d) weak classifiers while changing both the combination of cameras used for inputting the image sequence and the dimensions of the input subspace and reference subspaces, where d is the number of the combinations of the dimensions. L-3: Select the M ( N )-weak MSM classifiers and determine their weights by using the multi-class AdaBoost shown in Algorithm.1. Recognition phase R-1: Input n image sequences of an unknown hand shape using the n-camera system. R-2: Generate M ( N ) input subspaces with the M combinations of the cameras and the dimensions of input and reference subspaces, which are corresponding to the selected weak classifiers in L-3. R-3: Calculate the similarities of the input subspace and all class subspaces using a weak MSM classifier for each the combination. R-4: Vote the weight to the class with the highest similarity of all the that obtained. Do this vote for all the combinations. R-5: Classify the input set into the class with the highest voting.
Image Set-Based Hand Shape Recognition
561
2 5
1
3
4
Fig. 4. Multiple-camera system
Fig. 5. 27 types of hand shapes
4 Experiments 4.1 Evaluation Data We constructed a multiple-camera system to collect the evaluation images from seventeen subjects. The multiple-camera system consists of five IEEE1394 Point Grey Flea 2 cameras, as shown in Fig.4. The position of each camera is adjusted in such a manner that various view points of hand shape can be captured. The angle between the optical axes of the center camera and the other cameras was set to 18 degrees. The distance between the center camera and the other cameras was set to 21cm, and the distance from the center camera to the hand of a subject was about 40cm. To obtain more heterogeneous view of the hand shape, during the capture process the subjects rotated their hands to the left and right at a constant slow speed. Using this multiple-camera setup, we collected the 27 types of hand shapes shown in Fig.5. The total number of collected images is 123000 (=90 frames × 5 cameras ×
562
Y. Ohkawa, C.H. Suryanto, and K. Fukui
Fig. 6. Images of the same hand shape collected from 17 subjects Table 1. Results of Experiment-I Methods Naive MSM Voting-5 Voting-31
ER [%] EER[%] 10.54 4.91 9.47 3.24 8.82 2.42
27 shapes × 17 subjects). Figure 6 shows the various appearances of the same type of hand shape collected from 17 subjects. Figure 7 shows the sequential images captured by the five cameras. We cropped the hand shapes using skin color information and reduced the size to 32 × 32 pixels. Next, we extracted 140-dimensional feature vectors using higher-order local auto-correlation[20] from the four-level pyramid structure of the input image. Sixteen subjects were used for learning, and one subject was used for evaluation. We repeated the experiment 17 times (once for each of the 17 subjects) and the average was taken as the experimental result. We divided the 90 test images into 15 sets, each containing 6 images. Classification is done 6885 (=15 sets × 27 shapes × 17 subjects) times for each experiment. We adopt error rate (ER) and equal error rate (EER) for performance evaluation. 4.2 Experiment-I This experiment evaluated the effectiveness of the multiple-camera selection by using naive MSM and voting classification. The dimension of the reference subspaces are set from 1 to 15, and the input subspaces from 1 to 5. Since each method in this experiment requires a different optimized subspace dimension, we use the optimized subspace dimension in the recognition phase. The experimental results are shown in table 1. In the Ranking-5 method, the voting is done using the five cameras only. While, the Ranking-31 method uses the 31 selections from the five camera. The experimental results show that by utilizing all of the possible camera combinations, the recognition performance is notably improved.
Image Set-Based Hand Shape Recognition
563
Camera 1 Camera 2 Camera 3 Camera 4 Camera 5 Fig. 7. Example of sequential images captured by five cameras Table 2. Results of Experiment-III Num. Patterns Method 194400 5-Camera Ranking Proposed KMSM KOMSM 648000 5-Camera Ranking Proposed KMSM KOMSM
ER[%] EER[%] Recog.Time[ms] 9.47 3.24 20 8.73 2.16 132 7.7 4.48 2044 7.33 2.11 2728 9.21 2.74 21 7.79 2.03 134 -
4.3 Experiment-II In this experiment, we evaluated the relationship between the number of weak classifiers and the classification performance and computational cost of the recognition process. Various weak classifiers are generated not only by changing the camera selection, but also by changing the dimensions of input and reference subspaces. The dimension of the input subspace is set to 1, 2, or 3. The dimensions of reference subspaces are set from 5 to 90 in increments of 5. Thus, the total number of weak classifiers is 1674 (=3 × 18 × 31). The experimental results are shown in Fig.8. The figure shows that the performance is notably improved by increasing the number of multi-class AdaBoost weak classifiers. When the number of weak classifiers reaches 30, both the ER and EER converge. The number of weak classifiers and the recognition time are linearly related. 4.4 Experiment-III In this experiment, we compare the proposed method with the MSM, KMSM, and KOMSM classifiers. Since KMSM and KOMSM use the kernel trick, calculation
564
Y. Ohkawa, C.H. Suryanto, and K. Fukui
ER 11 10 ]9 [%8 R EE7 6 dn 5 a 4 R E3 2 1 0
EER
Recog. Time 250 200 ]s m [ 150 em iT .g 100 oc e R 50
1
7
13
19
25
31
37
43
49
0
Number of weak classifiers Fig. 8. Results of Experiment-II
becomes impossible when the number of learning patterns is substantially increased due to the complexity and the large memory requirement for the kernel trick computation. In fact, in our experimental setup, we are unable to add more learning patterns for KOMSM on a PC with 16GB of memory. On the other hand, the proposed method does not have this limitation on the number of learning patterns. To show the advantages of the proposed method, we performed another experiment in which the number of learning patterns is substantially increased. We collected new 481950 hand shape images (= 210 frames × 5 cameras × 27 shapes × 17 subjects) to be used as additional learning patterns. The experimental results are shown in Table 2. As with Experiment-I, the 5-Camera Ranking methods shown in Table 2 do not use combinations of the five cameras. The proposed method uses all of the possible five-camera combinations as weak classifiers and employs multi-class AdaBoost to generate strong classifiers from them. This experiment demonstrates that the proposed method is about 20 times faster than KOMSM while having comparable performance. In the experiment in which the number of learning patterns is substantially increased, the EER of the proposed method is better than that of KOMSM while the recognition time is still much shorter than that of KOMSM. Next, we show the detail of the kinds of weak classifiers that are selected by multiclass AdaBoost. As explained previously, the weak classifiers are generated from all possible camera combinations and various input and reference subspace dimensions. Figure 9 shows the top eight selected weak classifiers arranged from the highest weight 5.4 (leftmost) to the 1.1 (rightmost). As an example from the figure, the first weak classifier selected by the multi-class AdaBoost chooses the upper, left, and right cameras with reference subspace dimension 45 and input dimension 2. It is worth noting that in the total of 510 selections, weak classifiers using all of the five cameras are never selected by the multi-class AdaBoost. Another interesting fact is that the center camera is less likely to be selected.
Image Set-Based Hand Shape Recognition
: Selected camera
565
: Unselected camera
Camera Combination Weight
5.4
2.8
2.4
2.0
1.8
1.7
1.4
1.1
Ref. Dim
45
85
5
80
60
10
25
45
Input Dim
2
2
1
1
1
1
1
1
Fig. 9. Top eight weak classifiers selected by multi-class AdaBoost
5 Conclusion This paper proposes an image set-based hand shape recognition method using camera selection driven by the multi-class AdaBoost. In the proposed method, we consider a simple linear mutual subspace method as a weak classifier, and construct a strong classifier by integrating these weak classifiers. The obtained strong classifier could outperform one of the state-of-the-art nonlinear kernel methods, KOMSM, without using the kernel trick technique and with smaller computational cost. Acknowledgment. This work was supported by KAKENHI (22300195).
References 1. Erol, A., Bebis, G., Nicolescu, M., Boyle, R., Twombly, X.: Vision-based hand pose estimation: A review. Computer Vision and Image Understanding 108, 52–73 (2007) 2. Stenger, B., Thayananthan, A., Torr, P., Cipolla, R.: Model-based hand tracking using a hierarchical bayesian filter. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1372–1384 (2006) 3. Imai, A., Shimada, N., Shirai, Y.: Hand posture estimation in complex backgrounds by considering mis-match of model. In: Asian Conference on Computer Vision, pp. 596–607 (2007) 4. Martin, J., Crowley, J.: An appearance-based approach to gesture-recognition. Image Analysis and Processing, 340–347 (1997) 5. Birk, H., Moeslund, T., Madsen, C.: Real-time recognition of hand alphabet gestures using principal component analysis. In: Scandinavian Conference on Image Analysis, vol. 1, pp. 261–268 (1997) 6. Cui, Y., Weng, J.: Appearance-based hand sign recognition from intensity image sequences. Computer Vision and Image Understanding 78, 157–176 (2000) 7. Ohkawa, Y., Fukui, K.: Hand shape recognition based on kernel orthogonal mutual subspace method. In: IAPR Conference on Machine Vision Applications, pp. 222–225 (2009) 8. Yamaguchi, O., Fukui, K., Maeda, K.: Face recognition using temporal image sequence. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 318–323 (1998) 9. Sakano, H., Mukawa, N., Nakamura, T.: Kernel mutual subspace method and its application for object recognition. Electronics and Communications in Japan 88, 45–53 (2005) 10. Wolf, L., Shashua, A.: Learning over sets using kernel principal angles. The Journal of Machine Learning Research 4, 913–931 (2003) 11. Fukui, K., Yamaguchi, O.: The kernel orthogonal mutual subspace method and its application to 3d object recognition. In: Asian Conference on Computer Vision, pp. 467–476 (2007)
566
Y. Ohkawa, C.H. Suryanto, and K. Fukui
12. Fukui, K., Yamaguchi, O.: Face recognition using multi-viewpoint patterns for robot vision. In: 11th International Symposium of Robotics Research, pp. 192–201 (2003) 13. Kawahara, T., Nishiyama, M., Kozakaya, T., Yamaguchi, O.: Face recognition based on whitening transformation of distribution of subspaces. In: Workshop on Asian Conference on Computer Vision, Subspace 2007, pp. 97–103 (2007) 14. Li, X., Fukui, K., Zheng, N.: Image-set based face recognition using boosted global and local principal angles. In: Asian Conference on Computer Vision, pp. 323–332 (2009) 15. Kim, T., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image set classes using canonical correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1005–1018 (2007) 16. Kim, T., Arandjelovic, O., Cipolla, R.: Boosted manifold principal angles for image set-based recognition. Pattern Recognition 40, 2475–2484 (2007) 17. Chin, T., Suter, D.: Incremental kernel principal component analysis. IEEE Transactions on Image Processing 16, 1662–1674 (2007) 18. Bishop, C.: Pattern recognition and machine learning. Springer, New York (2006) 19. Zhu, J., Rosset, S., Zou, H., Hastie, T.: Multi-class adaboost. Technical report, Department of Statistics, University of Michigan 1001 (2006) 20. Otsu, N., Kurita, T.: A new scheme for practical flexible and intelligent vision systems. In: IAPR Workshop on Computer Vision, pp. 467–476 (1988)
Image Segmentation Based on k-Means Clustering and Energy-Transfer Proximity Jan Gaura, Eduard Sojka, and Michal Krumnikl ˇ - Technical University of Ostrava, VSB Faculty of Electrical Engineering and Computer Science, 17. listopadu 15, 708 33 Ostrava-Poruba, Czech Republic {jan.gaura,eduard.sojka,michal.krumnikl}@vsb.cz
Abstract. In image segmentation, measuring the distances is an important problem. The distance should tell whether two image points belong to a single or, respectively, to two different image segments. Although the Euclidean distance is often used, the disadvantage is that it does not take into account anything what happens between the points whose distance is measured. In this paper, we introduce a new quantity called the energy-transfer proximity that reflects the distances between the points on the image manifold and that can be used in the image-segmentation algorithms. In the paper, we focus especially on its use in the algorithm that is based on k-means clustering. The needed theory as well as some experimental results are presented.
1
Introduction
Image segmentation plays an important role in many image-processing applications. Many image-segmentation algorithms have been developed up to these days. Despite of this fact, new algorithms are constantly sought for since image segmentation is a difficult problem whose successful solution is necessary for the subsequent image-processing steps. In this paper, we are concerned with the clustering algorithm that is called k-means [1,2,3]. Although this algorithm has not been originally developed specifically for image processing, it has been adopted by the computer vision community and is used up to these days [4]. The k-means algorithm requires the a priori knowledge of the number of clusters (k) into which the image pixels should be grouped. Each pixel of the image is repeatedly and iteratively assigned to the cluster whose centroid is closest to the pixel. The centroid of each cluster is determined on the basis of the pixels that have been assigned to that cluster. Both deciding the membership of pixels into the clusters as well as computing the centroids is based on computing distances. The Euclidean distance is used most frequently since its computation is simple. The problem is that the use of Euclidean distance may result in mistakes in the final image segmentation as will be explained in the next section. Therefore, also other metrics should be considered. Asgharbeygi and Maleki [5] proposed the use of geodesic distance, Kashima et al. [6] used the L1 distance. Naturally, the use of different distances G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 567–577, 2011. c Springer-Verlag Berlin Heidelberg 2011
568
J. Gaura, E. Sojka, and M. Krumnikl
(other than Euclidean) is not limited only to the k-means clustering. The alternative metrics can be used in a variety of image-segmentation algorithms. Hosni et al. [11], for example, use the geodesic distance in the mean-shift algorithm. In this paper, we propose the use of a new energy-transfer proximity (proximity is the opposite of distance; a long distance is the same as a small proximity) that was introduced in [7]. The energy-transfer proximity has been developed to overcome some shortcomings of the Euclidean, geodesic, and resistance distances. It does not only consider the geometrical distance between two points on the image manifold, but it also takes into account the area of image segments in which the points are lying. Another advantage is that it makes it possible to measure the distances not only between two points (pixels), but also between a set of points and a point, or between two point sets. This property can be advantageously exploited in some algorithms, including the k-means algorithm in which it can be used for measuring the distances between the clusters and other pixels in image. The paper is organized as follows. In Section 2, we discuss the problems connected with measuring the distances in image-segmentation algorithms. In Section 3, the new energy-transfer proximity is introduced and the differences in comparison with the known diffusion distance are explained. The classic kmeans image-segmentation algorithm is briefly summarised in Section 4. Our modification of the algorithm based on the use of energy-transfer proximity is presented in Section 5. Experimental results are shown in Section 6.
2
On Measuring Distances in Image-Segmentation Algorithms
In image segmentation, measuring the distances is an important problem. The distance should tell whether two image points belong to a single or, respectively, to two different image segments. Although the Euclidean distance is often used, the disadvantage is that it does not take into account anything what happens between the points whose distance is measured (Figs. 1a, 1b). Since the input data may define certain manifolds in some space (e.g., the brightness function creates such a manifold) the distance measured along this manifold can be used instead of the straight-line Euclidean distance. Examining the manifolds that are defined by data seems now to be a recognised direction in data analysis. The geodesic distance is a well-known metric that is usually reported in this context. Unfortunately, its use is not problem-free since it only takes into account one shortest path and does not consider the “width” of the connection. False narrow paths, occurring due to erroneous data, may be detected (Fig. 1c). Recently, a resistance distance was introduced by Klein and Randic [16] and further developed, e.g., by Babic et al. [12]. It is defined as a resistance between two points in the resistor network that corresponds to the image grid. The resistance distance measures, in some way, the strength of the connection between the nodes. If many conductive paths exist between two points (convincing connection), the conductivity between them is high. Unfortunately, the resistance
Image Segmentation Based on k-Means Clustering
(a)
(b)
(c)
569
(d)
Fig. 1. On some problems arising when measuring the distances between two points (A, B) in images: Various situations (a – d) are discussed in text
distance may introduce an unwanted ordering of distance values, e.g., it may report the distance from Fig. 1a as the shortest, then the distance from Fig. 1c, and, finally, the distance from Fig. 1d. It follows that none of the metrics discussed up to now is fully satisfactory. Intuitively, it seems that a tool for measuring the sizes is missing in them. The width of the channel from Fig. 1c, for example, should be measured relatively with respect to the sizes of the areas on both its ends. The problem that different values of resistance distance are reported for situations in Figs. 1a, 1d can also be viewed as a problem whose roots lie in the absence of measuring the sizes. Generally speaking, the capability of measuring the sizes may be achieved by introducing another dimension into account, which may be the time. The use of diffusion equation is known in this regard. By introducing the time, a certain transition state is solved instead of the steady state. The quantity that is found by making use of the diffusion equation (e.g., concentration, amount of heat, charge, etc.) depends on time. The time may be seen as a necessary parameter that specifies the properties of the desired segmentation. A longer time tells that only the areas of bigger sizes are expected (i.e., the smaller areas should be regarded as unwanted details). Several metrics based on solving the diffusion equation were proposed. The diffusion distance (see the next section) may be regarded as a basic approach from which several further distances have been derived, e.g., the commute-time distance (Fouss et al. [14]), biharmonic distance (Lipman et al. [18]), and kernel distance (Joshi et al. [15]). The common feature of all these methods is that they are based on the diffusion equation. The equation can be solved either by iteratively integrating the equation itself or via spectral decomposition of the Laplacian matrix. In the latter case, a map can also be created in such a way that the Euclidean distance in the map corresponds to a distance measured on the manifold. We note that the use of diffusion equation is not new to digital image processing; it stands behind, e.g., the Perona-Malik [19] image filtering method. The energy-transfer proximity we propose to use in this paper falls into the family of quantities that are based on the diffusion equation. It differs, however, substantially from the distances that were mentioned above. The motivation for introducing a new distance was twofold: (i) It seems that the use of the distances mentioned above does not give the best sense for segmenting images.
570
J. Gaura, E. Sojka, and M. Krumnikl
(ii) In some algorithms, measuring the distances between two areas (i.e., not only between two points) would be helpful. Although the above distances can be generalised also for such cases, the corresponding generalisations do not give the values having a good sense for the use in the context of image-segmentation algorithms. The details can be found in the next section.
3
Diffusion Distance versus Energy-Transfer Proximity
In this section, we introduce a new quantity, called the energy-transfer proximity, that is based on the diffusion equation. For convenience of the reader, we start from the overview of the diffusion distance. The energy-transfer proximity will be introduced afterwards. Its properties will also be discussed. Given a compact Riemannian manifold M . Let f (x, t) be the amount of heat (generally, another substances can be considered instead of heat) at a point x of M and at a time t. Moreover, some initial distribution f (x, t = 0) is known. The function f (x, t) satisfies the diffusion equation ∂f (x, t)/∂t = −ΔM f (x, t), where ΔM stands for the Laplace-Beltrami operator of M . For the discrete case (the manifold is approximated by a mesh), the equation can be rewritten as ∂ + L f (t) = 0, (1) ∂t where L is a Laplacian matrix of the mesh and the heat-amount vector f (t) = (f1 (t), . . . , fn (t)) is indexed by the nodes of the mesh. We note that the entries 2 2 of L may be chosen li,j = e−cij /(2σ ) , where cij stands for the contrast between the pixels (greyscale or colour), σ is a parameter; the neighbouring pixels are considered. The solution of Eq. (1) can be written in the form of f (t) = H(t)f (0),
(2)
where H(t) is a diffusion operator (a matrix). The entries of H(t) will be denoted by ht (., .). It can be shown that the following is required for ht (p, q) to satisfy Eq. (1) n e−tλk upk uqk , (3) ht (p, q) = k=1
where upk stands for the p-th entry of the k-th eigenvector of L and λk is the k-th eigenvalue (the smallest eigenvalue of L is λ1 = 0). The entry ht (p, q) expresses the amount of heat that is transported, by the diffusion process, from the q-th node into the p-th node (or vice versa since ht (p, q) = ht (q, p) ) during the time interval [0, t]. The square of the diffusion distance between the nodes p, q is defined as the sum of the squared differences of the heat transported from p and q, respectively, to all other nodes (i.e., image pixels; A stands for the set of all nodes) by the equation 2 [ht (p, r) − ht (q, r)] . (4) Dt2 (p, q) = r∈A
Image Segmentation Based on k-Means Clustering
571
After some effort, the following equivalent expression can be deduced by making use of Eqs. (3), (4) Dt2 (p, q) = h2t (p, p) + h2t (q, q) − 2h2t (p, q).
(5)
From the notation introduced before, it should be clear that the first term on the right-hand side has the meaning of the amount of heat that will remain, after the time of 2t, in p providing that the diffusion proceeds from p (autodiffusion on p). The second term has a similar meaning, but for q. The third term is the amount of heat that is transported from p to q or vice versa. The distance between two sets of nodes, denoted by P and Q, respectively, may be simply introduced as follows ⎤2 ⎡ ⎣ ht (p, r) − ht (q, r)⎦ Dt2 (P, Q) = =
p∈P
r∈A
h2t (p, p) +
q∈Q
p∈P
h2t (q, q) − 2
q∈Q
h2t (p, q),
(6)
p∈P,q∈Q
where the first sum in the square brackets stands for the amount of heat that was transported from P to r; similarly for the second sum, which is the heat transported from Q to r. The second equivalent expression can be deduced again after some effort, similarly as in the previous case. We claim that the diffuse distances introduced in Eqs. (4) and (6) do not meet all the requirements on how the distances should be measured in image segmentation. Let us mention at least the following two problems. Say that we have computed a certain value of distance. In the final values obtained from Eqs. (4) and (6), there is no clue for judging whether any heat at all was transferred between p and q; if nothing was transferred, the value of the distance cannot express anything about the fact whether or not p and q should belong to one image segment. Instead, only the conductivities of the neighbourhoods of p and q are measured in this case and a small value of distance only indicates that p and q both lie in the areas of more or less constant brightness (high conductivity); not necessarily the fact that the brightness of the areas is the same and that the points lie in one image segment. One could say that this can be influenced by the choice of the time interval. In the real-life images, however, the segments of various sizes must be expected. Therefore, it is not possible to choose one value of time that would solve the problem. Another problem is with measuring the distance between two sets of nodes by making use of Eq. (6). The value of zero is obtained only in the case that P = Q. The value of distance increases (among others) with the increasing difference of the sizes of sets. If the sizes are different, a nonzero value is obtained even for t = ∞. In the image-segmentation algorithms that are based on merging, for example, such a property is undesirable. The distance should indicate whether or not two sets are to be merged together despite the fact that their sizes are different. Moreover, say that we have an algorithm that is based on determining
572
J. Gaura, E. Sojka, and M. Krumnikl
the membership of the pixels to the successively growing areas. The use of the distance from Eq. (6) would lead to the unwanted behaviour of the algorithm. The membership to the small areas would be preferred due to the similarity in size. In order to overcome the problems mentioned above, we introduce another quantity, called the energy-transfer proximity, that is based on the diffusion equation. Say that the proximity/distance between p and q is to be measured. The main changes (comparing with the diffusion distance) are the following: (i) The unit temperature level is applied to p and it is hold there for the whole time of the measurement. (ii) The value of oriented proximity from p to q is measured as an amount of energy that was transferred from p to q during a certain chosen time. For showing the connections and differences of the new quantity with the diffusion distance, we will come from Eq. (1), which must now be written in a more general form of ∂ (7) C + L f (t) = 0, ∂t where C is a diagonal matrix of capacities, which was not needed before because it was possible to consider the unit capacities in all nodes. In this case, by making use of the capacities, we can simulate the fact that the unit values should be applied not only at the beginning as in the case of the diffusion distance, but also during the whole time of measurement. This can be simply done in such a way that, at the points at which the unit level is to be applied and hold, we suppose the infinite capacity. The infinite capacity makes it possible that the energy can flow out from the node without decreasing the level at the node. Since the matrix of capacities can be inverted, the problem with capacities can be transformed to the previous one that was described by Eq. (1). The difference is now that we have C−1 L instead of L. Since C−1 L is not symmetric (in contrast with L), we obtain H(t) that is non-symmetric. It follows that the proximity from p to q generally differs from the proximity in the opposite direction. It can also be seen that different positions of energy source lead to the need for solving the eigenvalue problem for different matrices. On one hand, it can be regarded as a sign that a qualitatively different value is computed. On the other hand, it shows that the computation via matrix spectral decomposition will hardly be usable in practice, i.e., only the iterative integration of the diffusion equation itself remains as a practical computational method. For the image-segmentation algorithms, the energy-transfer proximity has the following advantages: (i) If there is no energy transfer between p and q (P , Q) a zero proximity (infinite distance) is reported. In this way, the algorithm is informed that the value should not be used for making decisions. Another measurement is required that can be carried out later on, after the conditions are more favourable. As the size of the area that is a source of energy (say P ) increases, the proximity from P to Q increases. In the algorithms that are based on the membership decisions for particular pixels, this seems to be a kind of an acceptable a priori
Image Segmentation Based on k-Means Clustering
573
expectation; the pixels that have not been decided yet will more often fall into the big existing areas than to the areas that are small.
4
k -Means Clustering
To evaluate its efficiency, we have used the energy-transfer proximity in the kmeans clustering algorithm. In this section, we firstly summarise the principles of the usual version of the algorithm. Our modification will be described in the next section. In k-means clustering, the following objective function is minimised (k)
J=
K N k=1 i=1
(k)
||xi
− ck ||2 ,
(8) (k)
where N (k) is the number of data points in the k-th cluster, xi stands for the i-th data point that was assigned to the k-th cluster, and ck is the centroid of the k-th cluster. The k-means algorithm works in the following steps: 1. 2. 3. 4.
The number of clusters (K) is chosen. The initial positions of the cluster centroids are selected randomly. The distances from the centroids to all other data points are computed. The membership of each pixel to a cluster is determined. The pixel is assigned to the cluster whose centroid is closest to the pixel. 5. The new centroids of the clusters are computed. This step is described below in more details. 6. The process is repeated iteratively from Step 3 until the centroid positions do not vary. Step 5 in the previously described algorithm can be carried out in two possible ways: (i) using simple position averaging; (ii) using the median position from the pixels assigned to the cluster (centroid). The first approach is a standard way described in almost all literature [1,2,10]. In the second approach, the centroid position is restricted to one of data points, which is necessary if only the distances on the image manifold can be measured. The point that minimises the sum of squared distances to all remaining points in the cluster is chosen as a centroid. This principle can be formalized by the expression ck = arg min (k) xi
(k)
(k) N
j=1
(k)
||xi
(k)
− xj ||2 ,
(9)
where xi is the i-th data point in that cluster. Unfortunately, O(N 2 ) time is required to find the centroids in each iterative step. As will be shown in the next section, the use of the energy-transfer proximity offers a more elegant way of computation in which determining the centroid positions is completely avoided.
574
5
J. Gaura, E. Sojka, and M. Krumnikl
Computing k-Means Using Energy-Transfer Proximity
The k-means algorithm with energy-transfer proximity works as follows 1. The seeds of image segments are defined manually (i.e., so called seeded segmentation is carried out); for one segment, the seeds may contain one or more pixels. The seeds are taken as an initial approximation of the clusters. It implies that the desired number (K) of clusters is also determined in this way. 2. The energy-transfer proximities from the cluster approximations to all image points that have not been assigned to the clusters yet are computed. 3. The membership of pixels to the clusters (segments) is decided. A pixel is decided to be a member of a certain cluster if the energy-transfer proximity from that cluster to the pixel that is being decided is greater than is the proximity from another clusters and, at the same time, greater than a chosen threshold. The use of threshold is important since the membership should not be decided on the basis of values that are small and unreliable. It follows that in one iteration cycle, the membership is not decided for all pixels. It also follows that, if the algorithm run in this way, the segments could only grow since once the membership has been decided, it cannot be changed. The following modification is useful in this regard. The membership of the pixels on the boundary (near the boundary) of each cluster is considered uncertain and is computed repeatedly despite the fact that it has been decided before. The reliable inner part of each cluster is determined by erosion before each iteration; only this part is taken into the computation of proximity. The membership of boundary pixels may change during computation since more reliable values of proximity may be obtained later as the segments are evolving. In this way, the segments need not grow only; they may also become smaller or move during the iterations. 4. The process from Step 2 is repeated until the cluster areas do not vary. As can be seen, the 5-th step from the original k-means algorithm (computing the centroids) is not present here. This is due to the fact that the centroids are not needed in this case since we are able to compute the proximity between a set of points and a point directly, i.e., the centroids, which are representatives of the sets of points introduced only for computing the distances, are not needed in this case. This can be regarded as a positive feature. We note that the energy transfers are computed by directly and iteratively integrating the diffusion equation since the computation via the spectral decomposition would be extremely computationally expensive as was pointed out in Section 3.
6
Experiments
In this section, we present an evaluation of the energy-transfer proximity measurement used in the k-means algorithm. Real-life images from The Berkeley
Image Segmentation Based on k-Means Clustering
575
a
b
c
d Fig. 2. Various images segmented using the k-means algorithm. From the left to the right: source image, initial seeds provided by operator, image gradient provided by the dataset authors, the result of k-means segmentation with the Euclidean distance, the result of k-means segmentation using the energy-transfer proximity, segmentation available in [8] with the number of segments that is close to the number of seeds in the second image from the left
Segmentation Dataset and Benchmark [8] are used. The initial number of seeds and their positions have been set manually (see the previous section). We present the results of segmentation as well as a comparison with the hand made segmentation presented in [8]. Since more hand made segmentations are usually available in [8], we have chosen the one whose number of segments is close to the number of seeds we used. In Fig. 2, the images are segmented using the modified k-means segmentation algorithm that was described in the previous section. For a comparison, the results of the standard k-means segmentation using the Euclidean distance are provided too. In Fig. 2a, it can be seen that there is almost no difference between the results obtained by making use of the Euclidean distance and energy-transfer proximity. This is due to the fact that the image is very simple. The Euclidean distance is sufficient to carry out the correct segmentation in this case. The source image in Fig. 2c contains more noise than the previous two source images. This can be especially seen on the water surface. The Euclidean-distance based segmentation is more sensitive to such noise. In contrast, the proximity-based segmentation preserves the local contrast, e.g., the edges between the water and sail correctly. Fig. 2d is an another example of image with noisy areas, in this case
576
J. Gaura, E. Sojka, and M. Krumnikl
grass. The Euclidean-distance based segmentation is almost unable to distinguish between the buffaloes and the grass. The proximity-based segmentation preserves the grass segments without mixing them with other ones.
7
Conclusion
Measuring the distance/proximity is an important part of the image segmentation process. We have proposed a new method how to measure the proximity along the image manifold. We have shown that the Euclidean, geodesic, and resistance distances have some disadvantages, especially, they omit the influence of the size of the areas where the measurement is taking place. The energytransfer proximity we propose has the following two main advantages: (i) it takes into account the size of the areas; (ii) it measures the proximity between two points, between a point and a set of points, and between two point sets (i.e., not only between two points). The energy-transfer proximity falls into the family of quantities that are based on diffusion equation. We claim that the usual diffusion distance between two points need not be suitable for image segmentation purposes in the situation when the time is set so that the energy (substance) does not reach from the starting point to the target one. The energy-transfer proximity, in contrast, detects the same situation correctly. The same conclusion applies when measuring the proximity between a point and a set of points and between two point sets. To utilize the energy-transfer proximity, we have incorporated it into the kmeans algorithm that may be used for image segmentation. The algorithm can take advantage of the above stated properties. We have presented an experimental evaluation of the energy-transfer proximity in this context. In our opinion, the energy-transfer proximity has a potential to reveal the image properties that cannot be detected by using the Euclidean, geodesic, diffusion, and resistance distances. It is also possible to use this proximity with other image-segmentation, machine-learning, and other computer-vision algorithms. Acknowledgements. This work was partially supported by the grant SP 2011/ ˇ - Technical University of Ostrava, Faculty of Electrical Engineering 163 of VSB and Computer Science.
References 1. MacQueen, J.B.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press, Berkeley (1967) 2. Lloyd, S.P.: Least Squares Quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982) 3. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics) 28, 100–108 (1979)
Image Segmentation Based on k-Means Clustering
577
4. Ravichandran, K.S., Ananthi, B.: Color Skin Segmentation Using K-Means Cluster. International Journal of Computational and Applied Mathematics 4(2), 153–157 (2009) 5. Asgharbeygi, N., Maleki, A.: Geodesic K-means Clustering. In: Proceedings of 19th International Conference on Pattern Recognition, pp. 1–4 (2008) 6. Kashima, H., Hu, J., Ray, B., Singh, M.: K-Means Clustering of Proportional Data Using L1 Distance. In: Proceedings of 19th International Conference on Pattern Recognition, pp. 1–4 (2008) 7. Gaura, J., Sojka, E., Krumnikl, M.: Image Segmentation Based on Electrical Proximity in a Resistor-Capacitor Network. In: Proceedings of Advanced Concepts for Intelligent Vision Systems, pp. 216–227 (2011) 8. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In: Proceedings of 8th International Conference of Computer Vision, pp. 416–423 (2001) 9. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley and Sons, New York (1973) 10. Luo, M., Yu-Fei, M., Zhang, H.J.: A Spatial Constrained K-Means Approach to Image Segmentation. Information, Communications and Signal Processing, 738– 742 (2003) 11. Hosni, A., Bleyer, M., Gelautz, M.: Image Segmentation Via Iterative Geodesic Averaging. In: Proceedings of the 5th International Conference on Image and Graphics, pp. 250–255 (2009) 12. Babic, D., Klein, D.J., Lukovits, I., Nikolic, S., Trinajstic, N.: Resistance-Distance Matrix: A Computational Algorithm and its Applications. International Journal of Quantum Chemistry 90, 166–176 (2002) 13. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 1–18 (2002) 14. Fouss, F., Pirotte, A., Renders, J.M., Saerens, M.: Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Transactions on Knowledge and Data Engineering 19, 355–369 (2007) 15. Joshi, S., Kommaraju, R.V., Phillips, J.M., Venkatasubramanian, S.: Comparing Distributions and Shapes Using the Kernel Distance. In: Proceedings of 27th ACM Symposium on Computational Geometry, pp. 47–56 (2011) 16. Klein, D.J., Randic, M.: Resistance Distance. Journal of Mathematical Chemistry 12, 81–95 (1993) 17. Ling, H.: Diffusion Distance for Histogram Comparison. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 246–253 (2006) 18. Lipman, Y., Rustamov, R.M., Funkhouser, T.A.: Biharmonic Distance. ACM Transactions on Graphics 29, 1–11 (2010) 19. Perona, P., Malik, J.: Scale-Space and Edge Detection Using Anisotropic Diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(7), 629–639 (1990)
SERP: SURF Enhancer for Repeated Pattern Seung Jun Mok1, Kyungboo Jung1, Dong Wook Ko2, Sang Hwa Lee3, and Byung-Uk Choi4 1
Dept. of Electronics and Computer Eng., Hanyang University 2 Dept. of Intelligent Robot Eng., Hanyang University 3 Dept. of Electrical Eng., BK21 Information Technology, Seoul National University 4 Division of Computer Science & Eng., Hanyang University
Abstract. This paper proposes an object-matching method for repetitive patterns. Mismatching problems occur when descriptor-based features like SURF or SIFT are applied to repeated image patterns due to the use of the usual distance-ratio test. To overcome this, we first classify SURF descriptors in the image using mean-shift clustering. The repetitive features are grouped into a single cluster, and each non-repetitive feature has its own cluster. We then evaluate the similarity between the converged modes (descriptors) resulting from mean-shift clustering. We thus generate a new descriptor space that has a distinct and reliable descriptor for each cluster, and we use these to find correlations between images. We also calculate the homography between two images using the descriptors to guarantee correctness of the match. Experiments with repeated patterns show that this method improves recognition rates. This paper shows the results of applying this method to building recognition; the technique can be extended to matching various repeated patterns in textiles and geometric patterns.
1
Introduction
Object recognition is one of the most important topics in image processing. Recognizing an object in an image requires describing features using local image descriptors [1] such as SIFT (Scale Invariant Feature transform) [2], PCA-SIFT, SURF (Speed Up Robust Features) [3], GLOH (Gradient Location and Orientation Histogram), shape context, and steerable filters. Histograms of local features are also used [4, 5]. These descriptors have been used in many applications including object recognition and tracking [6], texture matching [7], image retrieval [8], robot navigation [9], and scene understanding [10]. Local image descriptors should be invariant under geometric transforms such as image rotation, changes in scale and viewpoint, and photometric transforms such as illumination change, image blur, and image noise. Of all these approaches, SIFT and SURF descriptors are most widely used for object recognition because they can describe any object in an image. Using these techniques when all objects are unique makes matching with a template image easy. If patterns in the object repeat, however, object recognition is much more difficult. Recognizing an object with repeated patterns is one of the most difficult challenges in image processing. Previous approaches were based on comparing similarity to the nearest neighbor when matching descriptors. However, this method is incapable of matching G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 578–587, 2011. © Springer-Verlag Berlin Heidelberg 2011
SERP: SURF Enhancer for Repeated Pattern
579
the descriptors if continuously repeating patterns exist in an image. Similarly shaped windows make recognition of buildings particularly difficult. We propose a method of recognizing buildings and repeated textures that would be otherwise difficult due to their repeated patterns. Vectors that describe features of repeated patterns have similar values. This results in mismatching that leads to a reduction in recognition performance. For this reason, we classify repeated patterns that have similar vectors and recognize an object by comparing the repeated patterns that have been classified. We classify these repeated patterns using mean-shift clustering [11], which shifts to the cluster whose density is the highest, making it suitable for real-world data analysis. This approach can also handle arbitrary feature spaces. Several applications using this approach in object tracking [12] and image segmentation [13] are currently under study. However, unlike the SIFT or SURF approaches, we find repeated patterns by grouping extracted descriptors with similar ones. Repeated patterns grouped with similar ones converge to one mode, and we can recognize an object by comparing the similarity of converged modes from repeated patterns. Distance functions such that the Euclidean, Mahalanobis, and Hamming distances can be used to evaluate the similarity between the modes. We use the Euclidean distance because it is simple and widely used. Throughout the process above, determining whether an object is available can be done by classifying and comparing repeated patterns, although SIFT or SURF feature points remain mismatched. We propose a modification of the SURF matching process to overcome this drawback. We remove mismatched descriptors with repeated patterns and calculate a homography matrix using reliable feature points. By using homography to transform a database image, we can compare the real coordinates of the descriptors extracted from query images with those from the database image. This overcomes the problem of the most similar vector being selected in another location because the vector is selected only from within the fixed range of the extracted descriptors.
2
Extraction of Repeated Pattern
We use mean-shift clustering to classify feature points with similar vectors in images of buildings and repeated textures. Classified repeated patterns converged to modes, and we are able to recognize objects by comparing the similarity between modes. 2.1
Repeated-Pattern Matching Problem
When descriptor-based features such as SURF or SIFT are applied to images with repetitive patterns such as buildings, mismatching usually occurs because the descriptor vectors are too similar, resulting in mismatching parts as shown in Fig. 1. One feature point corresponds to multiple points, and the distances of the first- and second-nearest neighbors are too close to be distinct. This mismatching of similar repeated patterns makes finding exact correlations and geometric transformations between images difficult. We propose a method of exploiting the repeated features to solve this problem for image and object recognition. We first use mean-shift clustering to classify repeated patterns in the initial descriptor vectors. Multiple repeated features are grouped in a cluster represented by a converged mode vector, and each
580
S.J. Mok et al.
Fig. 1. SURF matching results of repeated patterns in buildings
non-repeating feature becomes a cluster with only one element. Each mode vector in mean-shift clustering is the SURF feature descriptor to be matched for recognition. Repeated patterns can be recognized by comparing the similarity of the classified modes from different images with converged modes from original images. We also improve the mismatched SURF descriptors by modifying the SURF matching process. 2.2
SURF descriptor
We extracted descriptors from images using SURF during experiments. We could have used SIFT, but SURF simplified and sped up the matching using Laplacian signs. The descriptors from outside the windows of buildings have positive Laplacian signs, whereas those from inside the windows have negative Laplacian signs (Fig. 2). Therefore, we can classify two repeated patterns, inside and outside windows, by comparing the Laplacian signs obtained by SURF. 2.3
Mean Shift Clustering
We performed mean-shift clustering on the extracted descriptors obtained using SURF. This approach relies on precise analysis of feature spaces, which have various
SERP: SURF Enhancer for Repeated Pattern
581
Fig. 2. Building image (left) and extracted SURF descriptors (right)
classes of shapes throughout the density estimation. Kernel density estimation is often used for estimating density in feature spaces where various shape classes exist. The mode from an estimated density can be used as a class center because it has the same position. The mode can be found using gradient information from kernel density estimation obtained by mean-shift clustering. We define the following kernel density estimation as follows: n ⎛ x − xi 2 ⎞ c ⎟ fˆh , K ( x) = k ,dd ∑ k ⎜ (1) ⎟ nh i =1 ⎜⎝ h ⎠ where ck,d is the normalization constant, n is the complete set of data, h is the kernel window size, k (x) is the kernel function, and x is the 64-dimensional vector of SURF descriptors. In the formula above, we only estimate the gradient, not the probability density function. We construct the kernel density estimate as follows: ∇ˆ f h , K ( x )
=
c g ,d ⎡ n ⎛ ⎢∑ g ⎜ nh d ⎢ i =1 ⎜⎝ ⎣
⎡ n ⎤ ⎛ x − xi 2 ⎞ ⎟ ⎜ x g ⎢ ⎥ ∑ i =1 i ⎜ 2 ⎟ h ⎥ x − x i ⎞⎟ ⎤ ⎢ ⎠ ⎝ − x⎥ ⎥⎢ 2 ⎟ h ⎛ x − xi ⎞ n ⎠ ⎥⎦ ⎢ ⎥ ⎟ g⎜ ⎢ ∑ i =1 ⎜ h ⎥ ⎟ ⎠ ⎝ ⎣ ⎦
= fˆh ,G ( x )[m h ,G ( x ) − x ]
(2)
582
S.J. Mok et al.
where g (x) is the derived function of the Kernel function k (x) . In Eq. (2), if mh ,G ( x) and x are the same, then the gradient will be 0 and x is the mode because mh,G (x) − x represents the mean shift. There is no need to know the trajectory information. Our experiment used the uniform kernel that would converge in the shortest time. We can find the modes to which each descriptor converges if the mean shift is performed on all descriptors extracted through SURF using Eq. (2). Then, the converged descriptors with the same modes are clustered together. The mean-shift clustering method, however, is greatly affected by h. This causes a problem when either the repeated patterns do not end up in a single cluster or all descriptors end up in separate clusters. The unmodified mean-shift method suffers from this problem. The number of clusters depends on the selection of the kernel size h in Eq. (2). We experimentally determined h to be 0.18.
Fig. 3. Clustered result of SURF descriptors for database (top) and query images (bottom)
2.4
Similarity Computation
SURF descriptors converge to each mode during mean-shift clustering. Points superimposed on the image show that repeated patterns converge to each mode (Fig. 3); points with the same color are the descriptors that converge to the same mode. The modes to which repeated patterns converge include many descriptors, whereas the others include only a few descriptors. Therefore, we used the modes that included the majority of the descriptors and did not compare all modes. As previously mentioned, Table 1. Euclidean distance between modes
Database Query Image 1 Image 2 Image 3 Image 4
Image 1 0.32688 1.15187 2.81242 1.57717
Image 2
Image 3
Image 4
1.23622 0.22592 2.15689 1.17471
2.21882 2.18314 0.33874 2.27516
1.24873 1.60032 2.13738 0.21871
SERP: SURF Enhancer for Repeated Pattern
583
the similarity was determined by measuring the Euclidean distance between modes with repeated features. Table 1 shows a distinct difference between the similarities of same buildings and of different buildings. Matching is easier using SURF because the modes with different Laplacian signs are not compared.
3
Feature Point Matching in Repeated Patterns
Homography transformation was used to solve the mismatching problem among SURF descriptors that frequently occurs with repeated patterns. Transforming an image makes it possible to compare the real coordinates of SURF descriptors extracted from a database image with those from a query image. We can thus modify the mismatching of descriptors.
Fig. 4. Database image (left), transformed image (middle) and query image (right)
3.1
Homography Transformation
We classified repeated patterns in an image and compared the similarity between the modes using mean-shift clustering. This process enabled recognition of an object with repeated patterns in an image. Even so, the problem of mismatching SURF descriptors within repeated patterns remained and was addressed using homography transformation. This transformation required at least four accurate feature points [14]. Mismatched feature points on an image had to be removed first to obtain more reliable ones. Because the feature points that are usually mismatched in repeated patterns are one-to-many matches, we first removed the descriptors that matched many SURF descriptors to one. We then removed the extracted repeated patterns using our proposed method. In cases where mismatched descriptors remained, we calculated the homography transformation using RANSAC and transformed the database image homographically (Fig. 4). We could compare the real coordinates of the descriptors extracted from the transformed database image to those extracted from the query image. This permitted modification of the mismatched descriptors. 3.2
SURF Matching Modification
Matching was performed using descriptors that have the most similar vectors in an image. Repeated patterns meant that many descriptors had similar vectors, causing the mismatching problem shown in Fig. 5. We chose ones that had the most similar vectors within a limited range to avoid matching with descriptors in an inappropriate location.
584
S.J. Mok et al.
Fig. 5. Usual SURF matching image (left) and Proposed SERP matching image (right)
Fig. 6. Database images used in the experiment
SERP: SURF Enhancer for Repeated Pattern
4
585
Experimental Results
We used buildings and textures with repeated features to classify and recognize repeated patterns (Fig. 6). An experiment was performed to compare the similarity between the modes to recognize the building and texture image. We obtained the equivalent of mean-shift clustering of the query image by pairing it with each database image (Fig. 7) and calculating the Euclidean distance between the modes calculated from the two images. We calculated the average value of the Euclidean distance between the modes when comparing the query image to those not corresponding to each database image. The same calculation was performed with the corresponding image (Fig. 7). The distance for corresponding images was less than 0.5, whereas that of the remainder was greater than 1. This distinct difference permitted recognition of repeating patterns.
1.8 1.6
Euclidean distance
1.4 1.2 1
(a)
0.8 0.6 0.4 0.2 (b) 0
0
5
10
15
20
25
30
35
40
Database image index
Fig. 7. Euclidean distances between (a) non-corresponding query images and the database image, and (b) corresponding query image and the database image
We compared the accuracy of the modified SURF and existing SURF matching methods on a recognized image using the same parameters and thresholds. The total number of matched descriptors and the number of correctly matched descriptors were calculated for both methods. Table 2 shows an accuracy performance on the order of 20–30% using the proposed method for repeated patterns. SURF suffered from mismatching when repeated patterns produced similar vectors. The proposed method is an improvement over the standard SURF method because the positions of similar descriptors are predictable. Our approach proved effective for solving the problem that results from the fact that the existing SURF method finds only similar vectors.
586
S.J. Mok et al. Table 2. Comparison the accuracy of SURF and SERP
5
Image index
Total number of matches
SURF Number of correct matches
Accuracy (%)
Total number of matches
SERP Number of correct matches
1
476
340
71.47
363
343
94.49
2 3
1057
753
71.25
878
821
93.51
528
382
72.27
457
428
93.65
4
954
671
70.37
851
811
95.30
5
167
118
70.45
145
137
94.48
6
168
124
73.57
148
142
95.95
7
303
216
71.24
256
243
94.92
8
440
303
68.86
393
378
96.18
9
230
156
67.61
189
181
95.77
10
466
314
67.32
386
374
96.89
11
305
222
72.70
264
256
96.97
12
310
214
68.87
267
257
96.25
13
892
382
42.87
392
377
96.17
14
407
225
55.20
223
218
97.76
15
163
100
61.26
94
91
96.81
16
155
98
63.39
89
85
95.51
Accuracy (%)
Conclusion
This paper has proposed an object-matching method for repetitive patterns. We first classified SURF descriptors in the image using mean-shift clustering. The repetitive features were grouped into one cluster, whereas each non-repetitive feature constituted its own cluster. Once the SURF descriptors were classified, we evaluated the similarity between the converged modes (descriptors) resulting from mean-shift clustering. We thus generated a new descriptor space with a distinct and reliable descriptor for each cluster. We used these descriptor vectors to find correlations between images. We also calculated the homography between two images using the descriptors to guarantee the correctness of the match. Experiments with repeated patterns demonstrated improved recognition rates for images and objects. The proposed method is applicable to matching repeated patterns such as those found in buildings, textiles, and geometric patterns.
References 1. Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1615–1630 (2005) 2. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 91–110 (2004)
SERP: SURF Enhancer for Repeated Pattern
587
3. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: SURF: Speeded Up Robust Features. Computer Vision and Image Understanding 110, 346–359 (2008) 4. Grauman, K., Darrell, T.: Pyramid Match Kernels: Discriminative Classification with Sets of Image Features. In: ICCV 2005, vol. 2, pp. 1458–1465 (2005) 5. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: CVPR 2006, New York, pp. 2169–2176 (2006) 6. Tuytelaars, T., Van Gool, L.: Matching Widely Separated Views Based on Affine Invariant Regions. International Journal of Computer Vision, 61–85 (2004) 7. Lazebnik, S., Schmid, C., Ponce, J.: Sparse Texture Representation using Affine-Invariant Neighborhoods. In: CVPR 2003, Madison, Wisconsin, USA, pp. 319–324 (2003) 8. Schmid, C., Mohr, R.: Local Gray-Value Invariants for Image Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 530–534 (1997) 9. Franz, M.O., Mallot, H.A.: Biomimetic Robot Navigation. Robotics and Autonomous Systems 30, 133–153 (2000) 10. Jiangjian, X., Hui, C., Feng, H., Sawhney, H.: Geo-Spatial Aerial Video Processing for Scene Understanding and Object Tracking. In: CVPR 2008, Anchorage, Alaska, USA, pp. 1–8 (2008) 11. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 603–619 (2002) 12. Comaniciu, D., Ramesh, V., Meer, P.: Real-time Tracking of Non-rigid Objects using Mean Shift. In: CVPR 2000, Hilton Head, pp. 142–149 (2000) 13. Miguel, A., Carreira, P.: Acceleration Strategies for Gaussian Mean-Shift Image Segmenta-tion. In: CVPR 2006, New York, pp. 1160–1167 (2006) 14. Richard, H., Andrew, Z. (eds.): Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003)
Shape Abstraction through Multiple Optimal Solutions Marlen Akimaliev and M. Fatih Demirci TOBB University of Economics and Technology, Computer Engineering Department, Sogutozu Cad. No:43 , 06560 Ankara, Turkey {makimaliev,mfdemirci}@etu.edu.tr
Abstract. Shape abstraction is an important problem facing researchers in many fields such as pattern recognition, computer vision, and industrial design. Given a set of shapes, a recently developed shape abstraction framework generates an abstracted shape through the correspondences between their features. The correspondences are obtained based on the optimal solution of a well known transportation problem. Considering the case where multiple optimal solutions exist for one problem, this paper ranks all optimal solutions based on how much they preserve the local neighborhood relations and creates the abstracted shape using the solution with the highest rank. Experimental evaluation of the framework demonstrates that the proposed approach compares favorably with the previous shape abstraction technique. Keywords: shape abstraction, transportation problem, multiple optimal solutions.
1
Introduction
The problem of object categorization can be defined as the process of locating and identifying instances of an object category within an image. In this problem, one of the most important and challenging problems facing researchers is the shape abstraction (or, sometimes referred to as shape simplification, shape averaging). Although this problem was extensively studied by early categorization techniques whose models captured shapes at high levels of abstraction, these techniques were not able to effectively recover shape abstractions from images of real objects [9]. Recently, the abstraction problem has been studied in many different contexts in which important contributions have been made. Closed surfaces, shape grammars, functional object descriptions, structural descriptions, active shape models, and graphs exemplify approaches addressing this problem. In the domain of closed surfaces [4], the shape abstraction process is performed on 3-D objects represented by a set of planar polygons at the 2D level. The approach consists of two steps; establishing a one-to-one correspondence between G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 588–596, 2011. c Springer-Verlag Berlin Heidelberg 2011
Shape Abstraction through Multiple Optimal Solutions
589
Fig. 1. An example of a transportation problem where more than one optimal solution exists. The first point set, P = {p1 , . . . , pn } is shown in the middle of the second point set, Q = {q1 , . . . , qn , . . . , q2n }. Assuming that d|pi , qi | = d|pi , qn+1 | and w(pi ) = w(qj ) for i = {1, . . . , n}, 2n optimal solutions exist.
the feature sets of a group of objects and performing averaging based on these correspondences. The averages are simply computed as the mean, the median, and the mode from the feature correspondences. Zhu and Mumford address structural variability within an object class through shape grammars [17]. A grammar represents hierarchical object part decompositions encoded in And-Or graphs. Using the idea that a grammar can model a coarse-to-fine appearance of an object, an effective structural abstraction techniques are proposed by some recent work, e.g., [13]. In the context of structural description one of the earliest work was a concept learning system [16]. Here, functional object concept descriptions were employed for both shape abstraction and recognition of object instances, which were given in terms of various visual and structural features of their parts,e.g., shape, orientation, and relative placement. Active shape models [6] use principal component analysis (PCA) (of coordinates) of landmarks to model shape variability. Although it ignores the nonlinear geometry of shape space by using its vector space approximation, simplicity and efficiency are the strengths of this approach. In the context of graph algorithms for shape averaging, Jiang et al. [12] define the median and the set median graphs. While the set median graph is drawn from an input graph set such that its sum distance to the other members is minimized, the median graph is a more general concept and thus it is not constrained to come from this set. The main drawback of this approach is its restrictive assumption that graphs belonging to the same class are structurally similar. This limits the framework’s ability to accommodate many-to-many feature correspondences that reflect within class shape variation.
590
M. Akimaliev and M.F. Demirci
(a)
(b)
Fig. 2. Parts (a) and (b) show two different point correspondences obtained by two optimal solutions for the same input sets shown in Figure 1. Although the optimal solutions have the same optimal value, their correspondences are different.
Recently, Demirci et al. [7] proposed a new technique for learning an abstract shape prototype from a set of exemplars whose features are in many-to-many correspondence. After representing input exemplars as graphs, the framework computes the node correspondences between a pair of graphs based on the matching framework of [8]. Using these correspondences, the approach generates an abstracted graph whose nodes show abstractions of corresponding feature sets and whose edges encode properties between the abstractions. Although powerful, the efficacy of this approach critically depends on the optimal solution and its point correspondences, which in turn obtained based on the well known transportation problem [1]. Given a large sets of point distributions in a high dimensional space, the transportation problem between these distributions may have more than one globally optimal solution. While all optimal solutions have the same optimal value, their point correspondences may be quite different. Figure 1 presents an example, where the first point set, P = {p1 , . . . , pn } is shown in the middle of the second point set, Q = {q1 , . . . , qn , . . . , q2n }. Assuming that each point of the first set is equidistant from its corresponding point of the second set and each point has a uniform weight, i.e., d|pi , qi | = d|pi , qi+n | and w(pi ) = w(qj ) for i = {1, . . . , n} and j = {1, . . . , 2n}, there exist 2n optimal solutions with the same optimal value. Two different point correspondences obtained by two such solutions are presented in Figure 2. In previous work [7], the abstraction of a shape pair is always computed based on the first optimal solution found by the matching algorithm proposed in [8]. In this paper, we extend the previous shape abstraction algorithm by taking into consideration all optimal solutions. We rank the optimal solutions by how much they preserve local neighborhood relations between points and their correspondences. We then proceed with the shape abstraction procedure using the
Shape Abstraction through Multiple Optimal Solutions
591
Fig. 3. A clock silhouette and its shock points. Each shock point is associated with a weight (radius of the maximal bi-tangent circle) and a position.
point correspondences obtained by the one with the highest rank. Experimental evaluation shows that the recognition performance of the proposed approach compares favorably with that of the previous shape abstraction technique. The rest of the paper is organized as follows. We describe the proposed shape abstraction technique in Section 2 and present our algorithm for ranking all optimal solutions used for the shape abstraction procedure in Section 3. After experimentally evaluating both shape abstraction frameworks in the context of 2D object recognition in Section 4, we present our future work and conclude the paper in Section 5.
2
Shape Abstraction Procedure
Our framework begins by representing shapes (silhouettes) as an undirected, rooted, weighted graph in which nodes represent shocks [15] and edges connect adjacent shock points. In [3], the medial axis is defined as locus of centers of circles inside the region which are bitangent to the boundary in at least two places. The center of a maximal circle is called a shock point [15] and each shock point is associated with its position and radius. Specifically, every shock point p is associated with a 3-dimensional vector v(p) = (x, y, r), where (x, y) are the Euclidean coordinates of p and r is the radius of the maximal bi-tangent circle centered at p. To convert shock graphs to shock trees, we compute the minimum spanning tree of the weighted shock graph. Each node in the tree is weighted proportionally to its radius, with the total tree weight being 1. Figure 3 shows a silhouette and its shock points. Having defined the shock tree representation for a pair of input shapes, the algorithm proceed with computing correspondences between their nodes using the matching approach presented in [8]. In that approach, the nodes of two input graphs are first embedded into fixed-dimension Euclidean space using lowdistortion embedding technique of [10]. This transformation reformulates the problem of graph matching to that of point matching. The embedded nodes (points) are then matched using the Earth Mover’s Distance (EMD) [14] algorithm. Since the average graph is computed based on these correspondences, obtaining a good correspondence play an important role in the effectiveness of the algorithm. Different sets of correspondences, in turn, yields different shape
592
M. Akimaliev and M.F. Demirci
abstractions. We will first explain how different correspondences may be obtained for same input point sets in the EMD algorithm. The EMD is based on a well-known transportation problem [1] and computes the dissimilarity between two distributions. Assume that each element in the first distribution indicates supplies and each element in the second distribution indicates demands at their positions. The EMD then finds the minimum amount of work required to transform one distribution into the other. Formally, let P = {(p1 , wp1 ), . . . , (pn , wpn )} be the first distribution, Q = {(q1 , wq1 ), . . . , (qm , wqm )} be the second distribution, where pi (or qi ) is the position of the ith element and wpi (or wqi ) is the weight, and let dij denote the the ground distance between points pi and qj . The objective of this problem is to find a flow matrix F = [fij ] with fij being the flow between pi and qj , which minimizes the overall cost: EM D(P, Q) =
n m
fij dij
i=1 j=1
such that
f ij ≥ 0, 1 ≤ i ≤ n, 1 ≤ j ≤ m n i=1 fij ≤ wpi m j=1 fij ≤ wqj n m m n i=1 j=1 fij = min i=1 wpi , j=1 wqj
The EMD is formulated as a linear programming problem and its solution provides the point correspondences and a dissimilarity score between the input sets. The EMD can be solved using some techniques such as transportation-simplex methods, interior-point algorithms, and minimum cost network flow techniques. The previous shape abstraction framework obtains the solution for the EMD using the transportation-simplex method [11], which stops after one optimal solution is found. However, a problem can have more than one optimal solution. Since the efficacy of the shape abstraction procedure directly depends on the correspondences found by the optimal solution, alternative optimal solutions should be considered to make the abstraction process more powerful.
3
Multiple Optimal Solutions for Shape Abstraction
The problem of finding all optimal solutions is not new and has been studied before, e.g., [5]. However, to the best of our knowledge, this paper presents the first framework where multiple optimal solutions are evaluated for shape abstraction. Given an optimal solution, Christofides and Valls [5] obtains an alternative optimal solution by circulating flow through elementary circuits. An efficient algorithm to compute all optimal solutions is also presented. Using the transportation-simplex method, whether a problem has multiple optimal solutions can be detected after the first optimal solution is computed. More specifically, if a problem has more than one optimal solution, at least one of the nonbasic variables is known to have a coefficient of zero [11]. When a problem with multiple optimal solution is found, alternative optimal solutions can be identified by performing additional iterations of the simplex method, e.g., each
Shape Abstraction through Multiple Optimal Solutions
593
time choosing a nonbasic variable with a zero coefficient as the entering basic variable. Details of the transportation-simplex method is beyond the scope of this paper but the reader is referred to some recent work such as [2]. Let S = {s1 , . . . , sr } be the set of optimal solutions for the same problem. We order the solutions by how much they preserve the local neighborhood relations between the input sets. One may expect that a good optimal solution should map the neighboring points in the first set to neighboring points in the second set. More precisely, let P = {(p1 , wp1 ), . . . , (pn , wpn )} and Q = {(q1 , wq1 ), . . . , (qm , wqm )} be the input distributions as described above. Let Np = {n1 , . . . , nk } be the k nearest neighbors of some point p ∈ P (Np ∈ P ). Assume that optimal solution s ∈ S maps Np to some set Mp = {m1 , . . . , nl } (Mp ∈ Q). The absolute value of the difference between the average distance in sets Np and Mp is used to compute the quality of the mapping. Formally, the quality of the mapping between Np and Mp is computed as follows: k k DNp = k12 i=1 j=1 dni ,nj l l DMp = l12 i=1 j=1 dmi ,mj DNp Mp = |DNp − DMp | 1 QNp Mp = 1+DN , M p
p
where dni ,nj (or dmi ,mj ) is the ground distance between the points. Overall, the quality associated with optimal solution s ∈ S is computed as QNp Mp . Q(s ) = p∈P
After we rank the optimal solutions, we start the shape abstraction procedure based on the correspondences found by the highest-ranked optimal solution. Based on these correspondences (mapping), an abstraction is formed using an oriented averaging procedure. Specifically, for each correspondence a point is added whose attributes are normalized, weighted average of the attributes of the points involved in the mapping. More formally, assume that P = p1 , ..., pk denotes a subset of points in P that are matched to a single point q ∈ Q. To compute the average ordered set, for each correspondence (P , q ) we add one point a to the abstraction such that its attributes, radius(ra ), and x and y coordinates (xa , ya ) are calculated as follows: 1 rp × fp ,q + rq , ra = 2wa p ∈P 1 xp × fp ,q + xq , xa = 2wa p ∈P 1 ya = yp × fp ,q + yq , 2wa
where wa = by the EMD.
p ∈P
p ∈P
fp ,q and f (p , q ) is the flow sent from p to q computed
594
M. Akimaliev and M.F. Demirci
Fig. 4. Top two rows show sample silhouettes for each object from the dataset, while the bottom row shows different views of the same object
3.1
Final Algorithm
Our algorithm for shape abstraction is a combination of the previous routines and summarized below. Algorithm 1. Shape abstraction 1: Compute the shock tree representations T1 and T2 for input silhouettes S1 and S2 , respectively. 2: Find the optimal solution between T1 and T2 using the matching algorithm of [8] and compute its rank based on the criteria described in Section 3. 3: Obtain the ranks associated with the other alternative optimal solutions, if any. 4: Create the shape abstraction using the correspondence found by the highest-ranked optimal solution.
4
Experiments
To evaluate the proposed approach and compare against the previous shape abstraction algorithm, we perform a set of experiments in a set of view-based object recognition experiments. Our dataset consists of 1620 silhouettes of 9 objects with 180 views per object (Figure 4). Every other view is removed to form a set of 810 query views and a remaining database of 810 views. For the database views, we select a subset of maximally similar pairs using the distance computed by the matching algorithm and obtain their abstracted views. For comparison, the abstracted views are computed using both the proposed and previous approaches. While the previous approach obtains the abstracted view based on the first correspondence found by the transportation-simplex algorithm in the EMD, the proposed approach uses the correspondence returned by the optimal solution with the highest rank computed according to our criteria mentioned in Section 3.
Shape Abstraction through Multiple Optimal Solutions
595
We run two types of experiments. In the first experiment, we compute the distance between an abstracted view and each database view. One may expect that a good abstracted view should have a smaller distance from its original views that are used to create the abstracted view than the other views. According to our results, in 81.7% and 88.4% of the experiments, the abstracted views are closer to the original views using the previous and proposed approaches, respectively. This demonstrates the effectiveness of our approach over the previous method. To provide a more comprehensive evaluation, each of the 810 query views is compared to each abstracted view in the second experiment. If the query and the original views of the abstracted view belongs to the same object category then the trial is successful. The results show that while in 91.5% of the experiments the previous work achieves a successful trial, the proposed approach increases this rate to 96.3%. Overall, these results indicate an improved shape abstraction process offered by the proposed algorithm. We should note that in the object recognition experiments, the improvement obtained by the proposed method is directly related to the number of times more than one optimal solution is found. In case only one optimal solution exists both the previous and proposed algorithms produce the same abstraction. Recall that we created abstracted views for maximally similar pairs using the database. Out of 810 database views, we have generated 550 abstracted views. We learned from the experiments that during the generation of these abstracted views 418 of the time multiple optimal solutions were found. We expect that the number of multiple optimal solutions and thus the improved performance of the proposed work would increase in a larger dataset.
5
Conclusions
Shape abstraction is an important topic addressed by researchers in different fields. Recently, this problem was formulated based on the feature correspondences, obtained using the transportation problem. Assuming that input shapes are represented as skeleton graphs, the previous approach computed an abstracted graph based on the vertex correspondences offered by the optimal solution of a well known transportation problem. Considering the case where more than one optimal solution exists, this paper ranks all optimal solutions based on how much they preserve the local neighborhood relations between the corresponding sets. Unlike the previous approach, which computes the abstracted graph using the fist optimal solution, the proposed approach obtains the abstracted graph using the optimal solution with the highest rank. Experimental evaluation of the framework yields more effective results over the previous work. Performing a more comprehensive experimental test using a larger dataset including additional comparison experiments against other shape-based (not only medial-axis) recognition algorithms, and studying various criteria for evaluating the optimal solutions, e.g., complexity and execution time, are our future plans.
596
M. Akimaliev and M.F. Demirci
Acknowledgements. Fatih Demirci gratefully acknowledges the support of ¨ ITAK ˙ TUB Career grant 109E183.
References 1. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications, pp. 4–7. Prentice Hall, Englewood Cliffs (1993) 2. Bajalinov, E.B.: Linear-Fractional Programming: Theory, Methods, Applications and Software. Kluwer Academic Publishers, Dordrecht (2003) 3. Blum, H.: Biological shape and visual science (part i). Journal of Theoretical Biology 38(2), 205–287 (1973) 4. Chen, E., Parent, R.: Shape averaging and its applications to industrial design. IEEE Computer Graphics and Applications 9(1), 47–54 (1989) 5. Christofides, N., Valls, V.: Finding all optimal solutions to the network flow problem. Mathematical Programming Studies 26, 209–212 (1986) 6. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models their training and application. Computer Vision and Image Understanding 61(1), 38–59 (1995) 7. Demirci, F., Shokoufandeh, A., Dickinson, S.: Skeletal shape abstraction from examples. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(5), 944–952 (2009) 8. Demirci, F., Shokoufandeh, A., Keselman, Y., Bretzner, L., Dickinson, S.: Object recognition as many-to-many feature matching. International Journal of Computer Vision 69(2), 203–222 (2006) 9. Dickinson, S.: The Evolution of Object Categorization and the Challenge of Image Abstraction, pp. 1–37. Cambridge University Press, Cambridge (2009) 10. Gupta, A.: Embedding tree metrics into low-dimensional euclidean spaces. Discrete & Computational Geometry 24(1), 105–116 (2000) 11. Hillier, F., Lieberman, G.: Introduction to mathematical programming. McGrawHill, New York (1990) 12. Jiang, X., Munger, A., Bunke, H.: On median graphs: properties, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(10) (2001) 13. Levinshtein, A., Sminchisescu, C., Dickinson, S.: Learning hierarchical shape models from examples. In: Proceedings, International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 251–267 (2005) 14. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000) 15. Siddiqi, K., Shokoufandeh, A., Dickinson, S., Zucker, S.: Shock graphs and shape matching. International Journal of Computer Vision 35(1), 13–32 (1999) 16. Winston, P., Binford, T., Katz, B., Lowry, M.: Learning physical description from functional descriptions, examples, and precedents. In: Proceedings, AAAI, Palo Alto, CA, pp. 433–439 (August 1983) 17. Zhu, S.-C., Mumford, D.: A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision 2, 259–362 (2006)
Evaluating Feature Combination in Object Classification Jian Hou1 , Bo-Ping Zhang1 , Nai-Ming Qi2 , and Yong Yang2 1 2
School of Computer Science and Technology, Xuchang University, China, 461000 School of Astronautics, Harbin Institute of Technology, Harbin, China, 150001
Abstract. Feature combination is used in object classification to combine the strength of multiple complementary features and yield a more powerful feature. While some work can be found in literature to calculate the weights of features, the selection of features used in combination is rarely touched. Different researchers usually use different sets of features in combination and obtain different results. It’s not clear to which degree the superior combination results should be attributed to the combination methods and not the carefully selected feature sets. In this paper we evaluate the impact of various feature-related factors on feature combination performance. Specifically, we studied the combination of various popular descriptors, kernels and spatial pyramid levels through extensive experiments on four datasets of diverse object types. As a result, we provide some empirical guidelines on designing experimental setups and combination algorithms in object classification.
1
Introduction
Although the computer vision community has put much effort in object classification in past decades, designing a practical object classification system remains a challenging task, even with datasets of relatively simple images. The main reason lies in that some images of the same class differs greatly while some of different classes are very similar to each other. While some powerful feature detectors and descriptors, e.g., SIFT [1], MSER [2], HOG [3], etc., have been proposed to tackle this large intra-class diversity and inter-class correlation [4], it’s clear that none is enough to deal with all object classes. Therefore it’s natural to combine the strength of many complementary features and produce a more powerful final feature. In the case of SVM classification, the feature combination translates into kernel combination. The basic kernel combination method is to use nthe average of all participating kernels as the final kernel, i.e., k ∗ (x, y) = n1 i=1 ki (x, y). Unlike averaging method assigning the same weight to all participating kernels, multiple kernel learning aims to optimize the weights wi on individual kernels in (MKL) n k ∗ (x, y) = i=1 wi ki (x, y) together with SVM parameters α and b [5–8, 4]. Different from MKL optimize all parameters jointly, [9] propose to use LPBoost G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 597–606, 2011. c Springer-Verlag Berlin Heidelberg 2011
598
J. Hou et al.
method to train the parameters in two steps. The SVM parameters are trained separately on individual kernels, and then the weights of all kernels are optimized. While researching on the weighting schemes in feature combination has become a trend in object classification, little work is done on the selection of features used in kernel combination. Researchers usually work on different sets of features and the selection of features may be empirically or based on personal experience. In this case, although the optimization based weighting scheme seems theoretically sound and practically effective, it’s not clear to which degree we should attribute the improvement in performance to the combination method and not the powerful feature set. As noticed in [9], with a carefully designed set of powerful features, the sophisticated optimization methods show no advantage over the simple averaging method. The strength of optimization methods in weighting lies in their ability to reduce the effect of weak features. This implies that if we can identify a set of strong features, we can use the simple average method to obtain comparable performance to that of optimization methods. This observation is supported by the LP-Boost method in [9], where it’s shown that not all participating kernels are adopted in the final combination and some are just discarded with zero weights. In this paper we study some feature-related issues in kernel combination. Firstly, we intend to find out if more features in combination definitely lead to better classification performance. Secondly, as we can build more than one kernels with one feature, we are interested to know if more kernels help to improve classification results. Thirdly, since spatial pyramid has become a popular representation for both local and global features, we will research on how to use the spatial pyramid representations in combination in order to obtain the best classification performance. Finally, based on the work on the three problems, we arrive at some guidelines on feature selection. These guidelines may be helpful in design experimental setups for kernel combination algorithms. The remainder of this paper is organized as follows. Section 2 presents the experimental setup used in all experiments. In Section 3, 4 and 5 we research on feature selection, kernel selection and spatial pyramids selection respectively by experiments, and draw some conclusions in each aspect. In Section 6 we conclude the paper with some guidelines and discussion.
2
Experimental Setup
Since the averaging combination method is simple yet powerful, we use k ∗ (x, y) = 1 n i=1 ki (x, y) to combine kernels in all our experiments on four datasets: n Caltech-101 [10], Scene-15 [11–13], Event-8 [14] and Oxford Flower-17 [15]. In all cases the multi-class SVM is trained in a one-versus-all setup and the regulation parameter C is fixed to be 1000. With Caltech-101 dataset [10], we randomly selected 30 images per class for training and up to 50 images in the remaining for testing. Different from some literatures using only 101 object categories, here we adopt all 102 classes in exepriments.
Evaluating Feature Combination in Object Classification
599
Fig. 1. Sample images of the event dataset. Two images per category are displayed with four categories in one row. From left to right and top to bottom, the categories are badminton, bocce, croquet, polo, rock climbing, rowing, sailing and snowboarding.
The Event-8 dataset [14] consists of images from 8 sports events categories: badminton, bocce, croquet, polo, rock climbing, rowing, sailing and snowboarding. Each category has 130 to 250 images. Besides classifying events from static images, the dataset present some other challenges for classification, including cluttered and diverse background, and various pose, sizes and views of foreground objects. See Figure 1 for sample images and a brief introduction. Following the setup in [14], we randomly select 70 images per class as training and another 60 images as testing, and report the 8-class overall recognition rate. The Scene-15 dataset [11–13] contains images from 15 categories with 200 to 400 images in each category. Figure 2 shows some example images and introduction. We follow the experimental setup in [13], i.e., 100 randomly selected images per class as training and all the others as testing, and report the mean recognition rate per class. Oxford flowers dataset [15] is composed of flower images of 17 categories with 80 images in each category. See Figure 3 for example images. We use 20 images per class as training and 20 others as testing.
3 3.1
Selection of Features Features for Evaluation
As feature combination is to use a set of features to improve classification performance, the first problem of interest would be if more features definitely produces better results. Many powerful local and global descriptors have been adopted in kernel combination [4, 9]. In this paper we use the following popular descriptors in evaluation. While some of these features may represent similar information and leads to redundancy, we also note that this problem is inevitable in practice with the existence of more and more features. Therefore we leave these features as is without analyzing the possible interaction among them. Note that all features are built in a spatial pyramid. At level 0 the descriptor are extracted from the whole image, and at level l the descriptors are extracted from 2l × 2l evenly segmented windows and then concentrated into one final descriptor.
600
J. Hou et al.
Fig. 2. Sample images of the scene dataset. Two images per category are displayed with five categories in one row. From left to right and top to bottom, the categories are bedroom, suburb, industrial, kitchen, living room, coast, forest, highway, inside city, mountain, open country, street, tall building, office and store.
PHOG Shape Descriptor. we construct Oriented (20 bins) and unoriented (40 bins) PHOG [16] level 0 to 3, and refer them as hog180 and hog360 respectively. Bag-of-SIFT-Descriptor. SIFT descriptors [1] are extracted on regular grids with spacing of 10 pixels and with patches of radii r = 4, 8, 12, 16 to allow for scalability. The descriptors are then quantized into a 500-bin vocabulary. We extract the descriptors in gray space for the Scene dataset (containing only gray images), and in gray, HSV and CIE-Lab spaces for the other 3 datasets. The visual words histograms are built from level 0 to 2. The descriptors extracted in gray, HSV and CIE-Lab space are denoted by gvw, hvw and lvw respectively. Locally Binary Patterns. The 256-bin histograms of the basic locally binary patterns (LBP) [17] are constructed from level 0 to 2, and referred to as lbp. Gist Descriptor. We also adopt the gist descriptor [11] extracted from level 0 to 1 denote it by gist. Self-similarity Descriptor. The self-similarity descriptors [18] of 30 dimensions (10 orientations and 3 radial bins) on 5 × 5 patches with spacing of 5 pixels are quantized to build a 500-bin vocabulary. The histograms are then built from level 0 to 2 and denoted by ssm. Gabor and RFS filters. Two texture features: Gabor and RFS filters [19] are used to build histograms (500 bins) from level 0 to 2. We refer to them as gab and txn respectively. Gray Value Histogram. Finally, we use the simple 64-bin gray value histograms from level 0 to 3. This gray value histogram is denoted by hoi.
Evaluating Feature Combination in Object Classification
601
Fig. 3. Sample images of the Flower-17 dataset. Two images per category are displayed with four categories per row in the top three rows and five categories in bottom row.
The 11 descriptors cover some of the popular features currently available for object classification and are all of acceptable dimension (< 10000). With features selected this way, we believe our conclusion from experiments are meaningful for practical applications. 3.2
Evaluation Procedures
For all descriptors, we build the kernel matrices by k(x, y) = exp(−d−1 0 d(x, y)), where d is the pairwise χ2 distances and d0 is the mean of pairwise distances. As stated in last subsection, we use 11 features for Caltech-101, Event-8 and Flower-17, and 9 features for Scene-15. Since there are more than one kernels for each feature, we use the combination of these kernels to represent each feature in this paper. For example, by gvw we mean the combination of the 3 kernels belonging to this feature. We evaluate the effect of features adopted on combination performance as follows. Firstly, we use each feature in classification separately with 10 trainingtesting splits. The mean recognition rates reflect the discriminative power of each feature. In the second step we do the feature combination in three modes: descending, ascending and bottom-top. With descending mode, we sort the features in descending order according to the discriminative power. Then we add the features into combination one by one and test the combination performance. The ascending mode is executed in a similar manner with ascending order. With bottom-top mode, we still sort the features in descending order by the discriminative power. However, in combination, we alternately add the features into combination from the beginning and the ending, e.g., in the order of 1, 11, 2, 10 · · ·. The performance of three modes are shown in Figure 4. We draw some interesting conclusions from Figure 4. Firstly, compared with the best single feature, the largest performance improvement by combination comes from several most powerful features. For all four datasets the recognition
602
J. Hou et al.
85
80
80
Recognition rate
90
85
Recognition rate
90
75
75
70
70
65
65
60
60 descending ascending bottom−top
55 50
1
2
3
4
5
6
7
8
9
10
descending ascending bottom−top
55
11
50
1
2
3
4
Number of features
(a) Event-8
6
7
8
9
(b) Scene-15 75
70
70
65
65
Recognition rate
80
75
Recognition rate
80
60
60
55
55
50
50
45
45
40
40 descending ascending bottom−top
35 30
5
Number of features
1
2
3
4
5
6
7
Number of features
(c) Flower-17
8
9
10
descending ascending bottom−top
35 11
30
1
2
3
4
5
6
7
8
9
10
11
Number of features
(d) Caltech-101
Fig. 4. Recognition rates from combination of features in different orders. Only the combination of several most powerful features performs better than the best single feature.
rates reach a peak with about 4 most powerful features, and more (less powerful) features do not improve or even decrease the performance (see descending mode, the blue line). Secondly, with the bottom-top mode, the addition of weak features always decrease the recognition rates. Thirdly, with the ascending mode, the addition of stronger features always improve the combination performance until all 11(9) features produce the best results of this mode. Although intuitively one may think that the strongest and weakest features are complement and their combination is powerful, our experiments with averaging combination do not support this argument. Instead, only the combination of most powerful features improve the classification performance evidently, and the addition of more weak features into combination do not produce better results. This is a little similar to the sparse solution in MKL or LP-β combination [9] where only part of features are selected in final combination. These observations also indicate a direction for optimization combination methods, i.e., exploring better performance from a large set of strong and weak features than averaging combination, since the latter is only effective with strong features. From the separate feature classification we find that for all 4 datasets, gvw and ssm act as strong features and gab and hoi act as weak ones. For color
Evaluating Feature Combination in Object Classification 90
603
90
80
Recognition rate
Recognition rate
80
70
gab gist gvw hog180 hog360 hoi hvw lbp lvw ssm txn
70
60
50
1
2
3
4
5
6
gab gist gvw hog180 hog360 hoi hvw lbp lvw
60
50
7
40
1
2
3
Number of kernels
(a) Event-8
5
6
7
(b) Scene-15
70
60
Recognition rate
70
Recognition rate
80
60
50
gab gist gvw hog180 hog360 hoi hvw lbp lvw ssm txn
50
40
30
4
Number of kernels
1
2
3
4
5
6
Number of kernels
(c) Flower-17
gab gist gvw hog180 hog360 hoi hvw lbp lvw ssm txn
40
30
7
20
1
2
3
4
5
6
7
Number of kernels
(d) Caltech-101
Fig. 5. Recognition rates from combination of different kernels. It’s evident that combining multiple kernels rarely improves classification performance.
datasets, i.e., Event-8, Flower-17 and Caltech-101, hvw and lvw also behaves as strong features. hog180, hog360, gist and txn always perform ordinarily. lbp performs rather differently in different datasets: very good for Scene-15, ordinary for Event-8, and rather bad for Flower-17 and Caltech-101.
4
Selection of Kernels
Although much effort has been put in the research on effective image representations, the amount of powerful features is still rather limited. On the other hand, researchers are working on designing novel kernels which exploit the potential of given descriptors and improve classification performance. Since different kernels capture different aspects of the similarity information and produce different classification results, it’s natural to expect the combination of multiple kernels to improve classification, just as in the case of feature. In fact, [20] and [21] have used tens of kernels in medical image classification. Same as in the case of feature, here we are interested to know if multiple kernels help to improve classification.
604
J. Hou et al. 90 level 0 level 1 level 2 level 3 level 01 level 012 level 0123
Recognition rate
80
70
60
50
Event−8
Flower−17
Scene−15
Caltech−101
Fig. 6. Recognition rates from combination of different kernels. It can be seen that combining multiple levels is usually guaranteed to produce the best performance.
For each feature, we use 6 kernels: linear, Gaussian, histogram intersection and 3 kernels from the distance measures χ2 , Euclidean and l1. The 3 kernels from distances are built in the form of k(x, y) = exp(−d−1 0 d(x, y)). These 6 kernels covers the commonly used ones and those shown to be discriminative. We think they are representative enough to be used in our evaluation. Then we test if the combination of multiple kernels improves classification performance. Since in Section 3 we have found that only the combination of strong features may produce an improvement in performance, here we test only with the descending mode. Same as in Section 3, we sort different kernels of each feature in descending order and include kernels in combination one by one. The results are shown in Figure 5. Unlike the combination of different features, we found that the combination of different kernels rarely improves the classification performance. Although different kernels capture different aspects of feature similarity, their combination with averaging method does not show better performance than the best single kernel. While this is a discouraging result, it also present a direction for optimization combination methods to show their advantage over the baseline averaging method, i.e., exploit the combination potential of multiple kernels. It’s interesting to observe the performance of different kernels. χ2 behaves as the most powerful kernel in all experiments. It’s followed by histogram intersection kernel (HIK) and l1 distance based kernel. While HIK has been shown to be an effective kernel [22], the simple l1 distance based kernel performs almost identical to HIK for all 4 datasets in our experiments. In the remaining kernels, Gaussian kernel performs better than Euclidean distance based kernel, and linear kernel is the least powerful one.
5
Selection of Spatial Pyramid Levels
Spatial pyramid has become a popular feature representation to make use of the spatial information among features. In this section we evaluate the effect of pyramid levels on feature combination. Firstly we combine different features
Evaluating Feature Combination in Object Classification
605
at each individual level. Then we combine the features from multiple levels to see if better performance can be obtained. In Figure 6 we show the results for individual level 0, 1, 2, 3 and combined levels 01, 012 and 0123. It’s evident from the comparison that although higher levels does not definitely lead to better performance, the combination of multiple levels does performs better than single levels. Therefore we suggest to use multiple pyramid levels in combination for better performance.
6
Conclusion
We investigated the impact of features, kernels and pyramid levels on feature combination with extensive classification experiments. With the baseline averaging combination, we arrived at some interesting conclusions. Firstly, the combination of strong features does perform better than the best single feature. However, the addition of more (weaker) features tends to decrease the combination performance. Secondly, the combination of multiple kernels shows on advantage over the best single kernel (χ2 based kernel). Thirdly, although higher levels in spatial pyramid not necessarily perform better than lower levels, the combination of multiple levels usually produce better results than single levels. We believe these conclusions are helpful in designing experimental setup and algorithm for feature combination, i.e., using multiple pyramid levels to obtain superior performance, and exploring better performance from multiple kernels and a large set of strong and weak features.
References 1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 2. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. In: British Machine Vision Conference, vol. 1, pp. 384–393. 3. Dalal, N., Triggs, B.: Histogram of oriented graidents for human detection. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893 (2005) 4. Yang, J.J., Li, Y.N., Tian, Y.H., Duan, L.Y., Gao, W.: Group-sensitive multiple kernel learning for object categorization. In: IEEE International Conference on Computer Vision, pp. 436–443 (2009) 5. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., Jordan, M.: Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research 5, 27–72 (2004) 6. Kumar, A., Sminchisescu, C.: Support kernel machines for object recognition. In: IEEE International Conference on Computer Vision, pp. 1–8 (2007) 7. Lin, Y.Y., Liu, T.L., Fuh, C.S.: Local ensemble kernel learning for object category recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 8. Varma, M., Ray, D.: Learning the discriminative power-invariance trade-off. In: IEEE International Conference on Computer Vision, pp. 1–8 (2007)
606
J. Hou et al.
9. Gehler, P., Nowozin, S.: On feature combination for multiclass object classification. In: IEEE International Conference on Computer Vision, pp. 221–228 (2009) 10. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: CVPR, Workshop on Generative-Model Based Vision, p. 178 (2004) 11. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42, 145–175 (2001) 12. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 524–531 (2005) 13. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178 (2006) 14. Jia, L.L., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: IEEE International Conference on Computer Vision, pp. 1–8 (2007) 15. Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: IEEE International Conference on Computer Vision, pp. 1447–1454 (2006) 16. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: ACM International Conference on Image and Video Retrieval, pp. 401– 408 (2007) 17. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transaction on Pattern Analsis and Machine Intelligence 24, 971–987 (2002) 18. Shechtman, E., Irani, M.: Matching local self-similarities across imagesn and videos. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 19. Varma, M., Zisserman, A.: A statistical approach to texture classification from single images. Image and Vision Computing 62, 61–81 (2005) 20. Schuffler, P., Fuchs, T., Ong, C., Roth, V., Buhmann, J.: Computational tma analysis and cell nucleus classification of renal cell carcinoma. In: 32 Annual Symposium of the German Pattern Recognition Society, pp. 202–211 (2010) 21. Ulas, A., Duin, R., Castellani, U., Loog, M., Bicego, M., Murino, V., Bellani, M., Cerruti, S., Tansella, M., Brambilla, P.: Dissimilarity-based detection of schizophrenia. In: ICPR Workshop on Pattern Recognition Challenges in FMRI Neuroimaging, pp. 32–35 (2010) 22. Barla, A., Odone, F., Verri, A.: Histogram intersection kernel for image classification. In: International Conference on Image Processing, pp. 513–516 (2003)
Solving Geometric Co-registration Problem of Multi-spectral Remote Sensing Imagery Using SIFT-Based Features toward Precise Change Detection Mostafa Abdelrahman, Asem Ali, Shireen Elhabian, and Aly A. Farag Computer Vision & Image Processing Lab. (CVIP), Univ. of Louisville, Louisville, KY, 40292 {Maabde01,aafara01}@louisville.edu, [email protected]
Abstract. This paper proposes a robust fully automated method for geometric co-registration, and an accurate statistical based change detection technique for multi-temporal high-resolution satellite imagery. The proposed algorithm is based on four main steps: First, multi-spectral scale-invariant feature transform (M-SIFT) is used to extract a set of correspondence points in a pair, or multiple pairs, of images that are taken at different times and under different circumstances, then Random Sample Consensus (RANSAC) is used to remove the outlier set. To insure an accurate matching, uniqueness constrain in the correspondence is assumed. Second, the resulting inliers matched points is used to register the given images. Third, changes in registered images are identified using statistical analysis of image differences. Finally, MarkovGibbs Random Field (MGRF) is used to model the spatial-contextual information contained in the resulting change mask. Experiments with generated synthetic multiband images, and LANDSAT5 Images, confirm the validity of the proposed algorithm.
1 Introduction Detecting changes in images of a given scene at different times is very important due to the increasing interest in environmental protection and homeland security. The increasing in demand for satellite imagery leads to enhancement in satellite imagery resolution, where sensors resolution has been improved from 30 m in Landsat 5, 15 m in Landsat 7 to 1, 0.6 m in IKONOS, and Quickbird. Even the resolution is better for the commercial satellites like WorldVeiw-1/2 and GeoEye-1. High-resolution satellite imagery often has different viewing geometries, sensor resolutions, solar illumination angles, and atmospheric conditions. Therefore, automatic registration of two images requires advanced techniques. In the other hand the high resolution data can be used toward accurate change detection. Change detection from high-resolution satellite imagery provides detailed information about the Earth’s surface for national security, environmental monitoring, land-use management, disaster assessment, etc. this led to the recognition of the fundamental role played by change-detection techniques in monitoring the Earth's surface [1][2][3]. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 607–616, 2011. © Springer-Verlag Berlin Heidelberg 2011
608
M. Abdelrahman et al.
Applications of change detection include: damage assessment from natural hazards (floods, forest fires, hurricanes), large-scale accidents (e.g., oil spills), and also keeping watch on unusual and suspicious activities in or around strategic infrastructures elements (dams, waterways, reservoirs, power stations, nuclear and chemical enterprisers, camping in unusual places, improvised landing strips, etc.). Most of the well-known supervised and unsupervised methods for detecting changes in remotely sensed images [2][5][6] perform sequentially image preprocessing, and image comparison to get a "difference" image, and then analysis this difference image. In the unsupervised change detection case, the preprocessing steps are necessary to make the two images comparable in both the spatial and spectral domains. The most critical step is to co-register the images with sub-pixel accuracy so that corresponding pixels within the images relate to the same ground area. Inaccurate co-registration makes change detection unreliable[7], so special techniques are used to reduce the registration errors [3][8]. A number of automated geometric co-registration approaches have been proposed for remote sensing applications based on geometric matching of areas and/or features. Area-based techniques work with image intensity values and are limited by the window size as well as by the similarity of the image pairs [14][15]. Typical features, such as regions, lines, local extrema points (EPs), corners, local maximum of wavelet coefficients, etc., are converted to point features to estimate the parameters of a geometric transformation function from input image to reference image [14]-[16]. Correspondence is a fundamental process to multiband registration between two images. This involves two main stages: (1) detecting interest points from images at hand, followed by (2) building an interest point descriptor for point matching to provide correspondences. It is crucial that the local features extracted from the detected interest points are robust to various deformations due to scale, noise, illumination and local geometric distortion. Mikolajczyk and Schmid [9] evaluated a variety of approaches for local descriptors and concluded that the SIFT – The Scale Invariant Feature Transform – algorithm [10] as being the most resistant to such deformations, Mukherjee in [18] use SIFT based features to reduce the dimensionality of hyperspectral data sets. In this work SIFT algorithm is used to localize and match correspondence interest points, these points will be used to compute an affine transformation to maps one image to the spatial coordinates of the other. However, this process can be misled by erroneous correspondences; hence RANSAC [17] is used to reject the outlier correspondences to maximize the registration accuracy. The contributions in this paper over what Hasan [18] are: first, Change detection is performed using statistical analysis of image differencing other than using only image differencing. Second, Markov Gibbs Random Field modeling [17][12] is then used to extract major changed regions while ignoring small intra-region variations. MGRF-based post-processing allows incorporating spatial interaction between adjacent or nearby pixel signals. Fig. 1 illustrates our framework. This paper is organized as follows; sec. 2 formulates the change detection problem statement, sec. 3 introduces SIFT-based multi-spectral registration, sec. 4 discusses the MGRF-based change detection algorithm, sec. 5 provides the experimental results, and conclusion in sec. 6.
Solving Geometric Co-registration Problem of Multi-spectral Remote Sensing Imagery
609
2 Problem Statement Change detection problem can be formulated as follows; let , ,…, be an image sequence for one scene at different times , such that 1,2, … , be a pixel vector of a multiband image with spectral bands, where , :1 ,1 is a finite spatial grid supporting the image. The main objective of a change detection algorithm is to generate a change mask 0,1 from a given consecutive time instances of a scene of interest : , such that 1 if there is a significant change at point .
Fig. 1. Proposed framework, given two images of the same scene at difference times, Step-1: SIFT algorithm is used to localize and match interest points in both images, Step-2: RANSAC is used to filter out bad matches which affect the result of registration, Step-3: the images are registered Step-4: changes detected using statistical analysis of image differences, and (MGRF) is used to model the spatial-contextual information contained in the resulting change mask
3 SIFT-Based Multi-spectral Image Registration As detailed in [9], SIFT consists of four main steps; (1) scale-space peak selection, (2) keypoint localization, (3) orientation assignment, (4) keypoint descriptor. In the first step, potential interest points are detected using a scale-space continuous function , it can be constructed by convoluting the image with a cylindrical Gaus, sian kernel , which can be viewed as a stack of 2D Gaussians one for each band. According to Lowe [9], the scale is discretized as , where 2 ⁄ , and 1,0,1,2, … , ⁄ .
610
M. Abdelrahman et al.
Scale-space extrema detection searches over all scales σ and image locations x x, y to identify potential interest points which are invariant to scale and orientation; this can be efficiently implements using Difference-of-Gaussians D x, σ L x, σ which takes the difference between consecutive scales, i.e. D x, σ L x, σ , where for a spectral band b , a point x is selected to be a candidate interest point if it is larger or smaller than its 3 3 3 neighborhood system defined on x, σ ; b , x, σ ; b , x, σ ; b , where σ is marked to be the scale of the point x. This process leads to too many points some of which are unstable. Hence removal of points with low contrast and points that are localized along edges is accomplished. In order to obtain a point descriptor which is invariant to orientation, a consistent orientation should be assigned to each detected interest point based on the gradient of its local image patch. Considering a small window surrounding , the gradient magnitude and orientation can be computed using finite differences. Local image patch orientation is then weighted by the corresponding magnitude and Gaussian window. Eventually the orientation is selected to be the peak of the weighted orientation histogram. Building a point descriptor is similar to orientation assignment. A 16x16 image window surrounding the interest point is divided into sixteen 4x4 sub-window. Then an 8-bin weighted orientation histogram is computed for each sub-window, hence we end up with 16x8 = 128 descriptors for each interest point. Thus each detected interest point can now be defined at location, specific scale, certain orientation and a descriptor vector, or , , , , . The SIFT operator : thus can be viewed as mapping a multiband image to interest interest points detected from band , with total point space with ∑ points detected from all the bands. Given two consecu, we use the SIFT operator to obtive time instances of a scene of interest , tain , . Interest point matching is performed to provide correspondences with SIFT descriptors and between the given images. Two points . and respectively are said to be in correspondence if their descriptors match in L2norm sense, i.e. argmin
,
,
2
,
1
∑1281
1 2
1 2
.
Rigid registration solves for a transformationT . , using iterative closest point alI is minimum. The accuracy of this gorithm ICP, such that the term T I transformation depends on the correspondences between interest points, however the resulting matched points from the SIFT algorithm might include incorrectly matched points. We use RANSAC [17] to filter out the bad correspondence points. Two randomly chosen correspondences are selected to be the minimal subset of the interest points sufficient to determine the transformation. The transformation T . is estimated to minimize the aforementioned cost function, all interest points of I are then transformed, points which deviate from the current transformation model by a specified threshold are considered outliers, hence the support of the model will be measured by the ratio of the inliers to the total number of correspondences. This procedure is ! repeated K-times, whereK , where denotes the total number of ! ! 2
Solving Geometric Co-registration Problem of Multi-spectral Remote Sensing Imagery
611
correspondences. Finally, the best fit transformation model is the one with the maximum support, and the correspondences marked as outliers are excluded. To validate the registration step the correlation coefficient is used to be a measure of the similarity between the source and target images before and after registration.
Where I represent the image and , stand for its mean and standard deviation respectively, E stands for the expectation. Impressive improvement is noticed after registration which means the alignment is perfect.
4 MGRF-Based Change Detection Significant changes always detected in localized group of pixels, hence the change is based on its neighborhood in each decision at a given point , : age , be the neighborhood in such that . Let , , ∆ ∆ and ∆ ∆ where ∆ and ∆ governs the neighborhood size, a similar definition can be applied to the neighborof the same point in the second image. To measure the degree of change hood locally to the point , the difference between neighborhood blocks is computed, i.e. , the change decision at this point can be calculated as; 1 0 In order to incorporate the spatial interaction of the resulting change mask . , we use Gibbs random field to provide a global model of the change mask in terms of the joint distribution of change decisions 0,1 for all image points . Adding the Markov property allows the establishment of a local model for each point based on its neighborhood. Let 1,2, … , be a set of image points, 0,1 be a set of decisions taken in the change mask, 1,2, … , be a set of class labels, , ,…, be the labeled change mask such that : , and , ,…, be a set of random variables defined on , hence is a realization of the field . The given change mask and the desired segmented change mask are described by a joint Markov-Gibbs random field (MGRF) which is fitted within the Bayesian framework of maximum-aposteriori estimation to estimate as, argmax | Ali et al. [10] use the pair-wise interaction model to define the neighborhood sys, : , of each point in . The image is represented by a tem MGRF with joint distribution, exp ∑ , , , where Z is a normalizing constant called the partition function, . is a potential/neighborhood function (Gibbs energy) and is the model parameters. The distribution | is a MRF by assuming that the noise at each point is independent [12], ∏ hence | | . for more details see
612
M. Abdelrahman et al.
5 Experimental Results The robustness of the proposed method for change detection is tested against simulated texture images with known ground truth changes. They are generated from mixture of Gaussians to yield 7 bands (Fig. 2). It can be observed that the changes detected in Fig. 2(g) are close to the ground truth changes shown in Fig. 2(h). While using MGRF-based post-processing removes unstable changes which have no significant spatial support.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 2. Results on simulated world map image (a) first image, (b) second image, (c) matched points using SIFT, (d) matched points after RANSAC, (e) the two images after registration, (f) the change mask, (g) the change mask after the MGRF, (h) the ground truth
Hurricane Katrina on Aug. 29, 2005 made landfall in south Plaquemines Parish, La., near the towns of Empire, Buras and Boothville.. It caused widespread destrution in Louisiana, Mississippi, and Alabama and turned out to be the most expensive hurricane in the history of the United States. The coastlines of those states were forever changed. NASA, using an Atlantic Global Research contract aircraft and the agency’s own advanced technology, made it possible to see how much and what type of damage that Katrina caused when it came ashore. Some results using images befor and
Solving Geometric Co-registration Problem of Multi-spectral Remote Sensing Imagery
(a)
(c)
(e)
613
(b)
(d)
(f)
Fig. 3. Sample of results (a) Matched points using SIFT, (b) Matched points after RANSAC, (c) The two images after registration, (d) The change mask, (e) Changes in first image, (f) Changes in second image
after the Hurricane using the proposed algorithm are shown here to see how much the damage caused by this Hurricane, the changes boundaries are in green Fig 3 for NASA photojournal images. The algorithm is also tested on real remote sensing LANDSAT5, LANDSAT7 images. Sample of results are shown in shown in Fig 4. The first image is path 177, row 038, and it is captured in 1998, while second image is of the same scene but captured in 2001, these images were taken of the River Nile delta in Egypt, whose coast is vulnerable to erosion, the change detection results confirm this issue. The total number of matched points = 443, total number of matched points after RANSAC = 364. The validation of registration step using the correlation coefficient shows large improvement after registration where the correlation coefficient is more the 90% while it was very small before registration.
614
M. Abdelrahman et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4. Result on a sample LANDSAT5 image (a) first image, (b) second image, (c) matched points using SIFT, (d) matched points after RANSAC, (e)the two images before registration, (f)the two images after registration, (g) the change mask, (h) the change mask after the MRF
6 Conclusion This paper introduced a robust change detection algorithm which is based on the SIFT operator to localize and match interest points in time sequence multi-spectral images taken for the same scene. The change detection was performed by statistical analysis of
Solving Geometric Co-registration Problem of Multi-spectral Remote Sensing Imagery
615
image differences after co-registration, followed by a MGRF-based post-processing step to incorporate spatial interaction in order to remove unstable changes. Results were shown using simulated multi-band images and real multi-spectral images. Validation of registration step using the correlation coefficient approves the high accuracy of the extracted features, and the registration step. Visual inspection confirms the validity of our system.
References [1] Carlotto, M.J.: Detection and analysis of change in remotely sensed imagery with applica-tion to wide area surveillance. IEEE Trans. Image Processing 6, 189–202 (1997) [2] Bruzzone, L., Serpico, S.B.: An iterative technique for the detection of land-cover transitions in multitemporal remote-sensing images. IEEE Trans. Geoscience Remote Sensing 35, 858–867 (1997) [3] Bruzzone, L., Serpico, S.B.: Detection of changes in remotely sensed images by the selective use of multi-spectral information. Int. J. Remote Sensing 18, 3883–3888 (1997) [4] Sırmaçek, B.: Object Detection in Aerial and Satellite Images, Ph.D. Dissertation, Yeditepe University, Istanbul, Turkey (2009) [5] Wiemker, D.: An iterative spectral-spatial Bayesian labeling approach for unsupervised robust change detection on remotely sensed multispectral imagery. In: Proc. 7th Int. Conf. Computer Analysis of Images and Patterns, Kiel, Germany (September 1997) [6] Nielsen, A.A., Conradsen, K., Simpson, J.J.: Multivariate alteration detection (MAD) and MAF processing in multispectral, bitemporal image data: New approaches to change detection studies. Remote Sensing Environ. 64(1), 1–19 (1998) [7] Townshend, J.R.G., Justice, C.O., Gurney, C.: The impact of misregistration on change detection. IEEE Trans. Geoscience Remote Sensing 30, 1054–1060 (1992) [8] Bruzzone, L., Prieto, D.F.: An adaptive and semiparametric and context based approach to unsupervised change-detection in multitemporal remote sensing images. IEEE Trans. Image Process. 11(4), 452–466 (2002) [9] Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. In: CVPR, vol. 2, pp. 257–263 (2003) [10] Lowe, D.: Object recognition from local scale-invariant features. In: ICCV (1999) [11] Lowe, D.: Distinctive image features from scale-invariant keypoints, cascade filtering approach. IJCV 60, 91–110 (2004) [12] Ali, A.M., Farag, A.A., Gimel’farb, G.: Analytical method for MGRF Potts model parameter estimation. In: 19th International Conference on Pattern Recognition, ICPR 2008, December 8-11, pp. 1–4, (2008) [13] Zitova, B., Flusser, J.: Image registration methods:Asurvey. Image Vis. Comput. 21, 977–1000 (2003) [14] Cole-Rhodes, A., Johnson, K.L., Le Moigne, J., Zavorin, I.: Multiresolution registra-tion of remote sensing imagery by optimization of mutual information using a stochastic gradient. IEEE Trans. Image Process. 12, 1495–1511 (2003) [15] Brown, L.G.: A survey of image registration techniques. ACM Comput. Surv. 24(4), 325–376 (1992) [16] Fonseca, L.M.G., Costa, M.H.M.: Automatic registration of satellite images. In: Proc. Brazilian Symp. Computer Graphics and Image Processing, pp. 219–226 (1997)
616
M. Abdelrahman et al.
[17] Fischler, M.A., Bolles, R.C.: Random Sample Consensus, A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. Of the ACM 24, 381–395 (1981) [18] Mukherjee, A., Velez-Reyes, M., Roysam, B.: Interest points for hyperspectral image data. IEEE Trans. Geosci. Remote Sens. 47 (2009) [19] Hasan, M., et al.: Multi-spectral remote sensing image registration via spatial relationship analysis on sift keypoints. In: 2010 IEEE Geoscience and Remote Sensing Symposium (IGARSS), pp. 1011–1014 (2010)
Color Compensation Using Nonlinear Luminance-RGB Component Curve of a Camera Sejung Yang, Yoon-Ah Kim, Chaerin Kang, and Byung-Uk Lee Dept. of Electronics Engineering, Ewha Womans University, 11-1 Daehuyn Dong, Seoul 120-750, Korea Tel.: +82-2-3277-3452; Fax: +82-2-3277-3494 {sejungyang,kgiraffe927}@gmail.com, [email protected], [email protected]
Abstract. Many color image processing methods employ gray scale image algorithms first and then apply color mapping afterwards. Most popular gamut mapping techniques are hue-preserving methods such as shifting or scaling of color components. However, those methods result in unnatural modification of color saturation. In this paper, we propose a novel color mapping method based on nonlinear luminance-RGB component characteristic curves of a camera taking the image. We obtain an accurate color value after luminance enhancement using luminance-RGB curves of the camera. All experiments demonstrate that our algorithm is effective and suitable to be embedded in a camera because of its simplicity and accuracy in color enhancement of digital images.
1
Introduction
The need for color compensation algorithms is increasing these days due to widespread use of digital image equipment such as digital cameras and digital televisions. It is well known that humans perceive color with brightness, hue, and saturation components and that one major requirement for high quality images is high contrast. Image enhancement techniques, such as contrast stretching, slicing, and histogram equalization have been successfully applied to digital images for their effectiveness in improving subjective image quality [1]. These techniques are suitable for the enhancement of gray images. However, application of these techniques to color images is not straightforward. Unlike gray scale images, color images have three attributes such as brightness, hue, and luminance, all of which are interrelated. For example, brightness change influences hue and color saturation. Therefore, we need to preserve hue and correct for saturation after luminance modification. Popular linear methods of preserving hue after luminance change are shifting and scaling [2]. Shifting translates red, green and blue components by the same amount. It is employed in many color histogram equalization methods [3][4]. However, shifting cannot always achieve an assigned luminance value due to the limited range of a gamut. Naik and Murthy developed an algorithm that applies scaling to color or complementary color components depending on the increase or decrease of the luminance to avoid the gamut clipping [5]. This method preserves hue, but it reduces color saturation component as described in Section 2.2. Hue-based color G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 617–626, 2011. © Springer-Verlag Berlin Heidelberg 2011
618
S. Yang et al.
saturation compensation is proposed using a saturation model as a function of hue [6]. While this algorithm does result in reasonable color compensation, it does not provide enough validation for the modeling. The primary objective of the proposed color compensation method is to obtain a color value that will be obtained from the same camera under modified gray level. Lee et al. proposed a color correction method based on the color characteristics of the camera which is used in taking the picture to implement realistic color compensation of an image [7]. Since luminance change in a camera can be achieved by varying exposure time, the method involves calculation of a modified color value by interpolating the camera color data taken at various exposure levels. However, this method has a drawback: the accuracy depends on the size of the database. In this paper, we overcome such a shortcoming employing modeling nonlinear luminance-RGB component characteristic curves of a camera. Since the proposed method employs characteristic curves instead of the color database of a camera, it enables efficient and effective color compensation using relatively small memory, thereby reducing calculation time. The algorithm is adequate to be embedded in a consumer digital camera in that the processing is fast due to the simplicity of the algorithm and the accuracy is sufficient as we will observe with experiments. This paper is organized as follows. An overview of hue preserving color compensation methods is given in section 2. In section 3, we present a new color compensation method employing nonlinear characteristic curves of a camera and explain how to apply them to color image enhancement. Comparison and analysis of experimental results are demonstrated in section 4. Finally, concluding remarks are provided in section 5.
2
A Review of Hue Preserving Gamut Mapping
In this section, we briefly describe hue preserving color compensation algorithms. Firstly, color shifting and scaling operations are explained. Then, we describe an improved hue preserving color compensation method avoiding gamut problem. 2.1
Scaling and Shifting
It is necessary to preserve hue in order to avoid unwanted color changes. In general, it is known that shifting and scaling of color signal are hue preserving operations [2]. First, we denote the normalized color values for R, G, and B component of a pixel of an image by r, g, and b, where 0 ≤ r , g , b ≤ 1 . The processed RGB vector, denoted
as ( r ′, g ′, b′) , can be easily computed by scaling the original RGB values to match the modified luminance:
(r ′, g ′, b′) = (α r , α g , α b )
(1)
where α is the ratio of the modified luminance to the original. Arici et al. used scaling transformation for image contrast enhancement [8]. Shifting moves the R, G, and B components by the same amount β , i.e,
( r ′, g ′, b′) = ( r + β , g + β , b + β )
(2)
Color Compensation Using Nonlinear Luminance-RGB Component Curve of a Camera
619
Geometric interpretation of shifting represents a translation of the color along (1, 1, 1) direction in RGB color space. Shifting is adopted in [3] and [4]. Scaling and shifting can be combined:
( r ′, g ′, b′) = (α r + β , α g + β , α b + β )
(3)
where, r ', g ', and b ' is linear in r, g, and b. The parameters α and β are constants independent of r, g, and b. We adopt the definition of HIS color system in [1] for simple analysis of hue and saturation.
V1 =
1 1 (−r − g + 2b), V2 = (r − g ), 6 6
(4)
S = V12 + V22 , H = tan −1 (V2 / V1 ) Scaling and shifting transformations given in (3) preserve the hue defined in (4). 2.2
Hue Preserving Color Compensation Method without Gamut Problem
Naik and Murthy proposed an efficient color compensation method preserving hue without gamut problems [5]. This algorithm obtains a new color value using a ratio of enhanced gray scale image to the input. Our proposed method is not sensitive to the definitions of luminance. Therefore, we define the luminance as l = r + g + b for simplicity and f(l) represents the enhanced luminance. The maximum value of color components is normalized to 1. To avoid a gamut problem where the resulting color falls outside of the realizable RGB space, the color modification depends on the luminance ratio α (l ) = f (l ) / l ; if α ≤ 1 , the modified color is obtained by multiplying α to the original color and if α > 1 , the same procedure is applied to a complementary color [5]. Naik and Murthy’s algorithm preserves the hue using scaling and shifting, but it reduces the saturation. Let S1 be the saturation of the original pixel and S2 be saturation of the compensated color. According to equation (3), if α ≤ 1 , S2 = α S1 and if
α > 1 , S2 = α ′S1 ; the saturation values are multiplied by α or α ′ which are less than 1; hence saturation reduces after color compensation. When α > 1 , the follow-
ing equation can be derived between r, g, b and r ', g ', b '.
( r′ g ′ b′)
T
= (1 − α ′) (1 1 1) + α ′ ( r g b ) T
T
(5)
Equation (5) implies that the compensated color is a mixture of the original color and 100 % white at the ratio of α ′ : (1 − α ′) , thus reducing the color saturation.
3
Color Compensation Based on Camera Characteristics
In this section, we describe the proposed color compensation method based on the characteristics of the camera used to capture images. In many cases, a gray image enhancement algorithm is applied to a luminance component, and RGB component is
620
S. Yang et al.
(a)
(b)
(c)
Fig. 1. (a) original image (b) a series of images taken at various exposure times (c) result image after color histogram equalization using the proposed method
modified depending on the amount of the luminance change. Most algorithms employ color shifting or scaling as shown in the previous section [2, 3, 4].However, these methods ignore the nonlinear color characteristics of the camera, which results in imprecise color. Modifying an image for inadequate or excessive exposure time results in changes in luminance that necessitate corresponding adjustments in RGB. In the proposed method, the luminance-RGB component curves of the camera are applied to restore the color components. Fig. 1 describes the overall concept of the proposed color compensation method. Fig. 1 (a) is the original image, and Fig. 1 (b) is a series of images of the same scene taken at various exposures. In an ideal case, we can take images of the same static scene with variable exposures. Through a stack of the images, we obtain a series of color data of the same pixel at various luminance levels. First, we apply grayscale image processing to the luminance component of an original image, thereby obtaining an enhanced grayscale image. Then, we obtain new color components from Fig. 1 (b) by finding the color values corresponding to enhanced gray level of each pixel. Therefore, the resulting picture is equivalent to changing exposure for each pixel to achieve the desired luminance. Since we cannot take many pictures for a given scene, we build a characteristic curve of the RGB component as a function of the luminance. We generate the nonlinear characteristic curves of a camera for each R, G, and B channels based on the method shown by Debevec et al [9]. Nonlinear camera response curves are successfully applied to high dynamic range imaging [9, 10], since they are precise and easy to construct [10, 11]. Fig. 2 shows the characteristic curves of the Canon 40D camera and the measured data. Photographs were taken at F/4.6 with exposure times ranging from 1/2500 to 1 second. The characteristic curves exhibit an accurate relationship with all the measured data points. Each curve is utilized for color mapping as shown in Fig. 3 The figure illustrates RGB component change caused by luminance modification from l1 to l2, where l1 is the luminance of the original image, while l2 is the luminance after image enhancement. If luminance is increased from l1 to l2, then we can find the log exposure change from the exposure curve log(Φ) = fY(l), where Φ is the exposure
Color Compensation Using Nonlinear Luminance-RGB Component Curve of a Camera
3
3
4
2
2
3
1
1
2
0
0
1
-1
-1
0
-2
-2
-1
Log(E∆t)
-3
-5
-2
-3
real value curve
-4
0
0.2
0.4
0.6
real value curve
-4
0.8
-5
1
621
0
0.2
0.4
0.6
real value curve
-3
0.8
-4
1
R
0
0.2
0.4
0.6
(a)
0.8
1
B
G (c)
(b)
Fig. 2. Comparison between camera characteristic curves and measured data. (a) R component (b) G component (c) B component. Lo g(Φ)
4 2
∆b
0 -2 -4 0
Log (Φ)
4
Lo g(Φ)
2
∆b
0.2
0.4
0.6
r1
0.8
1
R
0.8
1
G
0.8
1
B
r2
2
∆b
0 -2 -4 0
4
Y 0.5
1
l1
1.5
2
l2
2.5
0 -2 -4 0
3
0.2
0.4
0.6
g1 Lo g(Φ)
g2
4 2
∆b
0 -2 -4 0
0.2
0.4
b1
0.6
b2
Fig. 3. Nonlinear characteristic curves of a camera and shift in exposure after luminance change. (left: the log exposure of Y component, right: the characteristic curves of R, G, and B components).
defined as the product of the sensor irradiance and exposure time. The log exposure change Δb with respect to the luminance change from l1 to l2 is given as
Δb = log(Φ 2 ) − log(Φ1 ) = fY (l2 ) − fY (l1 )
(6)
We obtain the modified color values by applying the same exposure change to the camera RGB component-luminance curves. We perform inverse mapping after applying exposure change of r1, g1, and b1, which are r2, g2, and b2 respectively. The compensated values are acquired by shifting the log exposure of the original image for each R, G, and B channel, i.e.
r2 = f R−1 ( f R (r1 ) + Δb), g 2 = fG−1 ( fG ( g1 ) + Δb), b2 = f B−1 ( f B (b1 ) + Δb),
(7)
622
S. Yang et al.
1
G
0.8
0.6
0.4
0.2 gray_line shifting naik camera_A camera_B camera_C 0 0
0.2
0.4
0.6
0.8
1
(a)
B
← S=0
1
← S=0.8
← S=0.6
← S=0.4
← S=0.2
G
0.8
← S=0.2
0.6
← S=0.4
0.4
← S=0.6 0.2
0
← S=0.8 0
0.2
0.4
0.6
(b)
0.8
1
B
Fig. 4. (a) curves describing mapping of output color for each color compensation methods in green-blue space. The plots are 2D projections of curves in 3D RGB color space for succinct representation. Input color for this example is (R,G,B)=(0.7843,0.7843,0.3922). (b) saturation in green-blue space.
Color Compensation Using Nonlinear Luminance-RGB Component Curve of a Camera
623
where r2, g2, and b2 are compensated values, and fR(r), fG(g), and fB(b) denote the log exposure of the R, G, and B components. Therefore, we obtain the modified color values using the luminance-RGB component data set based on the camera characteristic and perform color compensation by utilizing the curve. The resultant image of the proposed algorithm is formed by setting the R, G, B values of pixels to the modified luminance of the pixels. The proposed method has low computational cost. Equations (6) and (7) are implemented using a look up table and linear interpolation of neighboring levels, which is simple and fast. Fig. 4 (a) compares the mapping of green and blue components of the original image after luminance change. It is a projection of a curve in three-dimensional RGB space to two-dimensional G-B plane for succinct explanation. The values of R, G, and B components of the original image are 0.784, 0.784 and 0.392, respectively for this example. The dashed line shows a gray scale line, the solid line represents the result of shifting technique, and the dash-dot line shows the result of Naik and Murthy’s method. The three lines with a marker are the result of our proposed method for three consumer cameras from three different manufacturers. Fig. 4 (b) shows the saturation of the example. We observe that the saturation of the grey line is zero, and the saturation is high for colors away from the gray line. From Fig. 4 (a) we observe that shifting cannot follow dark or bright regions due to cutoff or saturation. Naik and Murthy’s method can have 0% black or 100% white level; however it reduces color saturation due to scaling. In contrast, our proposed method is able to respond to the brightness change while maintaining high saturation, which results in vivid color rendition. We also observe different color characteristics of cameras A, B, and C: camera B shows high saturation in bright and dark regions, while camera C results in the lower saturation in both of the areas. Camera A is positioned in the middle of the camera B and C.
4
Experimental Results
We show the following experimental results to compare the existing methods to our proposed method. Fig. 5 and Fig. 6 show the results using images taken by camera A in overexposure and underexposure outdoor conditions. Fig. 5(a) and Fig. 6(a) are the original image. Gray scale histogram equalization is applied first and three different color compensation methods are applied: (b) is the result of shifting technique, (c) is the result of image enhancement by Naik and Murthy, and (d) is the result of our proposed method using nonlinear characteristic curves of the camera A. While the shifting method (b) tends to show excessive saturation, Naik’s method (c) shows faded image because of low saturation. The result of the proposed method (d) shows natural looking color compensation. The performances of the three methods are compared quantitatively using simulation of exposure change as shown in Fig. 7 and Table 1. Fig. 7 shows simulation results of exposure change from 1/200 sec to 1/100 sec under daylight illumination conditions (D65). Images of column (a) are taken at shutter speed 1/200 of a second by three cameras from different manufacturers. First we calculate a gray image at exposure time 1/100 of a second and then calculate color values using shift, Naik’s method, and the proposed method which are shown in column (c), (d), and (e), repectively. The ground truth images taken at 1/100 of a second are shown in column (b),
624
S. Yang et al.
(a)
(b)
(c)
(d)
Fig. 5. Compensated results of an overexposed input image. (a) original overexposed image (b) Shifting (c) Naik (d) the proposed method.
(a)
(b)
(c)
(d)
Fig. 6. Compensated results of an underexposed input image. (a) original underexposed image (b) Shifting (c) Naik (d) the proposed method.
camera A
camera B
camera C
(a)
(b)
(c)
(d)
(e)
Fig. 7. Simulation results of exposure change from 1/200 sec to 1/100 sec under daylight illumination conditions(D65). Each row shows different camera models. Column (a) is the original image captured at shutter speed 1/200 sec and (b) is the ground truth image captured at shutter speed 1/100 sec. Images of column (c), (d), and (e) are simulation of an 1/100 sec image from 1/200 sec image, using shifting, Naik, and the proposed method, respectively.
Color Compensation Using Nonlinear Luminance-RGB Component Curve of a Camera
625
Table 1. Comparison of rms error of color compensation methods. Column (b) of Fig.8 are the ground truth images and column (c), (d), and (e) are simulated images using color compensation methods of shifting, Naik and proposed method, respectively. Maximum of color components are normalized to 1.
camera A camera B camera C
(c) Shifting 0.011 0.013 0.017
(d) Naik 0.015 0.016 0.021
(e) proposed 0.004 0.010 0.010
and they are compared with the simulated images. The ground truth images (b) and the resultant images by proposed method (e) show visually similar color rendition. Table 1 shows root mean square (rms) errors between simulated images and the ground truth images (b). The range of RGB component is from 0 to 1. The error of the proposed method is the smallest, which confirms our visual observation. We also note that the error of shifting is smaller than that of Naik’s method. Since camera A has the highest accuracy, we infer that the fitting error of the luminance-RGB component curve of the camera is smaller than other cameras. Our algorithm used to estimate the color value from the luminance-RGB component curve matches accurately with the ground truth as shown in Table 1.
5
Conclusion
We presented a new efficient color compensation method using camera characteristic curves for fast and accurate color compensation. The proposed method achieves accurate performance, since it utilizes the nonlinear relationship between luminance and color of a camera. We employed camera characteristic curves instead of resorting to a huge database; thus, the proposed algorithm is fast and accurate. We compared the performance of existing shifting technique and hue preserving algorithm to our proposed method, and verified that the proposed method results in better color quality. Quantitative measurements also confirm our subjective image quality assessment. This work has a wide variety of applications in color image processing and is especially suitable to be embedded in a camera since the algorithm is simple and accurate, and it is based on the sensor characteristics of the camera capturing the image. Acknowledgement. This research was partly supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2011-0010378). It was also supported by the Human Resource Training Project for Strategic Technology through the Korea Institute for Advancement of Technology (KIAT) funded by the Ministry of Knowledge Economy, the Republic of Korea.
References 1. Pratt, W.K.: Digital Image Processing, 3rd edn. Wiley Interscience, Hoboken (2001) 2. Yang, C.C., Rodriguez, J.J.: Efficient luminance and saturation processing techniques for bypassing color coordinate transformations. In: Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics, vol. 1, pp. 667–672 (1995)
626
S. Yang et al.
3. Trahanias, P.E., Venetsanopoulos, A.N.: Color image enhancement through 3-D histogram equalization. In: Proc. 15th IAPR Int. Conf. Pattern Recognition, vol. 1, pp. 545–548 (August-September 1992) 4. Menotti, D., Najman, L., de Albuquerque, A., Facon, J.: A Fast Hue-Preserving Histogram Equalization Method for Color Image Enhancement using a Bayesian Framework. In: Proc. 14th International Workshop on Systems, Signal & Image Processing (IWSSIP), pp. 414–417 (June 2007) 5. Naik, S., Murthy, C.: Hue-preserving color image enhancement without gamut problem. IEEE Trans. Image Processing 12(12), 1591–1598 (2003) 6. Huang, Y., Hui, L., Goh, K.H.: Hue-based color saturation compensation. In: IEEE International Conference on Consumer Electronics, pp. 160–164 (September 2004) 7. Lee, H.-W., Yang, S., Lee, B.-U.: Color compensation of histogram equalized images. In: IS&T/SPIE Electronic Imaging, SPIE, vol. 7241, pp. 724111-1–9 (January 2009) 8. Arici, T., Dikbas, S., Altunbasak, Y.: A Histogram modification framework and its application for image contrast enhancement. IEEE Trans. Image Processing 18(9), 1921–1935 (2009) 9. Debevec, P.E., Malik, J.: Recovering High Dynamic Range Radiance Maps from Photographs. In: Proc. the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 369–378 (1997) 10. Hasinoff, S.W., Durand, F., Freeman, W.: Noise-Optimal Capture for High Dynamic Range Photography. In: Proc. Computer Vision and Pattern Recognition, pp. 553–560 (2010) 11. Matusik, W., Pfister, H., Ngan, A., Beardsley, P., Ziegler, R., McMillan, L.: Image-Based 3D Photography Using Opacity Hulls. In: SIGGRAPH 2002, pp. 427–437 (2002)
Augmenting Heteronanostructure Visualization with Haptic Feedback Michel Abdul-Massih, Bedˇrich Beneˇs, Tong Zhang, Christopher Platzer, William Leavenworth, Huilong Zhuo, Edwin R. Garc´ıa, and Zhiwen Liang Purdue University
Abstract. We address the need of researchers in nanotechnology who desire an increased level of perceptualization of their simulation data by adding haptic feedback to existing multidimensional volumetric visualizations. Our approach uses volumetric data from simulation of an LED heteronanostructure, and it translates projected values of amplitude of an electromagnetic field into a force that is delivered interactively to the user. The user can vary the types of forces, and they are then applied to a haptic feedback device with three degrees of freedom. We describe our methods to simulate the heteronanostructure, volume rendering, and generating adequate forces for feedback. A thirty one subject study was performed. Users were asked to identify key areas of the heteronanostructure with only visualization, and then with visualization and the haptic device. Our results favor the usage of haptic devices as a complement to 3-D visualizations of the volumetric data. Test subjects responded that haptic feedback helped them to understand the data. Also, the shape of the structure was better recognized with the use of visuohaptic feedback than with visualization only.
1
Introduction
Multidimensional data visualization has been addressed by the computer graphics community for a very long time. The traditional way to visualize such data is by limiting the rendering into the 3-D space and by substituting higher dimensions with visual clues, such as color, glyphs, or other geometry. However, some data sets are suitable for different approaches. Those from physics simulations are one kind, where electromagnetic fields or forces come into play. Rendering the forces directly as forces using haptic devices is intuitive. Haptic rendering offers a physical level of interaction between the real and virtual realms. It requires special devices (such as the Falcon device in Figure 1) that translate the data into mechanical energy to provide tactile interaction [1]. Although visualization is one-directional, the use of haptic feedback-capable devices enables a bidirectional interaction to be achieved [2]. LED technology has been growing in popularity in recent years, but several roadblocks have prevented its advancement in efficiency and spectral emission. They include nucleation, phase segregation, and piezoelectric and spontaneous G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 627–636, 2011. c Springer-Verlag Berlin Heidelberg 2011
628
M. Abdul-Massih et al.
polarization fields. Nanotechnology researchers are trying to engineer the quantum well geometry of these heteronanostructures to suppress their built-in electric fields. We hypothesize that using haptics for heteronanostructure visualization will significantly increase the user’s perception and understanding of complex structures the human eye might miss. We assume that users will be able to more easily identify electric field intensities in the heteronanostructure with the addition of a haptic device’s force feedback. The focus of our research is to create an application that will provide visuohaptic feedback and to test whether users will find visual feedback complemented with haptic feedback more useful than purely visual feedback in the task of finding key areas of an LED heteronanostructure’s electric field. Our application Fig. 1. The Falcon haptic deaddresses both experts in the field and novice vice used in our testing users, such as students of material science.
2
Previous Work
The need for broadening perceptual experience beyond visual clues has been identified in [3], and the same work addresses several challenges that must be overcome, such as real-time processing and much higher update rates (1kHz). Existing haptic devices allow humans to receive information from three translational directions, but certain applications require six degrees of freedom (DOF) [4]. However, 6-DOF haptic rendering is significantly harder [5] because all contact surfaces must be detected instead of stopping at the first detection. Calculating a reaction force and torque at every point is expensive at the high refresh rate, and the same may be said about controlling geometry-driven haptic instability, such as forcing the object into a narrow cavity. [6]. Beyond the common problems of haptic devices, there are specific issues when simulating objects at the micro- and nanoscales. Pacoret et al. [7] allowed the user to feel Brownian motion when doing micromanipulation with optical tweezers. Marliere et al. [8] added multisensorial interaction (seeing, hearing, and feeling) to an Atomic Force Microscope’s (AFM) manipulation of nano objects. A 1-DOF haptic device was used for the manipulation, and simulated forces were returned to it, closing the loop between human movements and the actual physical forces being generated. Sitti et al. [9] concentrated on nanomanipulation with AFM for 1-D and 2-D manipulations. They proposed new strategies for 3-D handling, e.g., picking and placing atoms. They later refined the method for giving force feedback based on nanoscale objects in [10]. Because the purpose is to remotely manipulate atomic objects, forces that are incorrectly rendered can damage the experimental samples. The above-mentioned research seeks to simulate real physical interactions. In our work, we focus on using force feedback to convey information that exists inside volumetric data.
Augmenting Heteronanostructure Visualization with Haptic Feedback
629
Different approaches are available for using haptics with volumetric data. Laehyun et al. [11] focused on a haptic rendering technique that uses hybrid surface representation. The geometric data provides the basic 3-D structure of the object, and the implicit data are mapped around the exterior of the object to provide a simulated tactile response. McNeely et al. proposed a method to manipulate a complex rigid object within an arbitrarily complex environment of rigid objects [6]. Their approach uses collision detection based on probing a voxelized environment with surface point samples. Durbeck et al. [12] integrated a haptic device with general-purpose scientific visualization software. Using their Sensable Phantom Classic hardware, they were able to receive force feedback on a 3-D vector field. Ikits et al. used directional constraints for guiding the user through volumetric data [13]. They proposed constraining the tool proxy along streamlines in a vector field and adding force to represent magnitude and tick marks to represent speed. The main contribution of our work is the actual evaluation and user study performed that attempts to quantify the effect of haptic on the final perception. In the next section we briefly describe the nanorod simulation, the data structures used, its visual rendering, and the haptic feedback. Section 5 describes our implementation and results of the tests. Section 6 concludes the paper and discusses some potential issues for future work.
3 3.1
Method Nanorod Simulation
(In, Ga)N nanostructures show great promise as the basis for next-generation LED lighting technology because they offer the possibility of directly converting electrical energy into light of any visible wavelength without the use of downconverting phosphors. An LED is a solid-state device (Light Emitting Diode) that transforms current into light by confining electrons and holes into a well where they recombine into photons. An heterostructure is a system made of two or more dissimilar semiconducting materials that when put into contact will create a transition region (a depletion zone) where electrons will have to perform electrical work to cross in one direction, and will be favored to go when injected in the opposite direction. In this context, an LED heteronanostructure is a system conformed of two semiconducting materials whose dimensions are on the order of 10s of nanometers in size. In this work, 3-D calculations of the mechanical and electrical equilibrium states in out-of-the-box nanoheterostructures with pyramidal caps, a geometry not very commonly assembled, demonstrate that by tuning the quantum well to cladding layer thickness ratio, hw /hc , a zero built-in electric field can be experimentally realized, especially for hw /hc = 1.28, in the limit of large hc values [14]. In [14] it was found that the (In, Ga)N system naturally grows this structure under the right conditions and promises to deliver efficiencies that would be unattainable through the traditional thin film geometry. Traditionally, LEDs are fabricated as thin films or mesas.
630
M. Abdul-Massih et al.
Simulation of the nanorod structure was carried out over a duration of 47 hours on a supercomputer system equipped with Red Hat Enterprise Linux 5 and 128 GB of RAM. A resulting mesh of 150×150×150 elements was generated. Each element carries multiple values; one is the electric field defined as E = (Ex , Ey , Ez ), and the other is a stress tensor ⎡ ⎤ σxx σxy σxz σ = ⎣ σyx σyy σyz ⎦ , σzx σzy σzz from which the hydrostatic stress is calculated as σh = 1/3(σxx + σyy + σzz ). We have performed various visualizations of the 3-D volumetric data, such as individual components of σ or E. In all, the haptic feedback was generated from the magnitude of the electric field: E = |E| = Ex2 + Ey2 + Ez2 . (1) For this application, the electric field is extremely important to visualize because ideally, a zero electric field in this structure is desired in order to have the greatest chance for the electrons and holes in the pyramid to find each other and react into a photon. A non-zero electric field shifts electrons into opposite sides in the thin film cap, making it less likely for the recombination of electrons and holes into light to take place. The electric field in a thin film is around 2M V /cm, while in the equivalent nanorod it reaches only 0.5M V /cm. 3.2
Volume Rendering
We have used volumetric ray casting to visualize the 3-D volumetric array from the previous section. Ray casting for the nanorod was carried out on a massive parallel processor (NVIDIA GeForce GTX 480) using CUDA to achieve interactive performance. The step size used for sampling along the ray cast from each screen pixel was one voxel, and we used trilinear interpolation for sampling. The transfer function used for converting electric field magnitudes to color was a gradient that made a transition from blue to red, and the function was given to us by researchers in physics. The volume rendering is constrained to the interior of the nanorod’s geometry, and several examples are shown in Figure 2. 3.3
Haptic Feedback
Haptic feedback is generated by processing the magnitude of the nanorod’s electric field (1) at the user’s desired location with a force transfer function f = F (E)
(2)
Augmenting Heteronanostructure Visualization with Haptic Feedback
(a)
(b)
631
(c)
Fig. 2. (a) Volumetric rendering of the nanorod’s field with the haptic cursor displayed as a sphere, (b) top view of the nanorod, and (c) shows a cut through the nanorod showing the intensity of the electric field mapped to a color variation
which converts E = |E| into a force f that is rendered through the haptic device. A 3DOF translation device was chosen for implementation of the nanorod’s electric field haptic feedback. A visual proxy object represents the location of the haptic device inside the virtual world (Figure 2a)). The value of the nanorod’s electric field at the proxy’s position is sampled using the nearest neighbor interpolation. The continuous haptic loop consists of the aforementioned steps at a minimum 1kHz, providing the perception of different types of forces. 3.4
Forces
Three types of haptic feedback were used to try conveying the nanorod’s electric field information: vibration, stiffness, and stiffness with vibration. Note that the equations below omit constants necessary to accommodate the range of values of the haptic device to the range of values of the nanorod simulation. Force type 1: presents the user with a vibration. The term F from (2) is evaluated as a function of a small random vector r and a sinus wave by which amplitude and frequency are modulated by E giving f 1 = rE sin(Eω).
(3)
We have used a fixed frequency ω = 50Hz in our implementation. The function (3) yields less vibration amplitude for the low magnitudes of E and more for the high magnitudes. Force type 2: gives the user a better understanding of the nanorod’s shape. The force vector is damping, and it acts in the opposite direction of the user motion v with a strength proportional to the intensity of the electric field. f 2 = −vE.
(4)
The force is proportional to the force with which the user is moving the device’s tool. Therefore this force feedback creates the effect of a surface being touched
632
M. Abdul-Massih et al.
on the boundaries with high gradient E. These are well articulated on the edges of the nanorod and in the proximity of the quantum well. Force type 3: The last type of force is a combination of the vibration and the damping. The objective of this feedback is to try to convey shape and high- and low-magnitude locations in the electric field at the same time. f 3 = f 1 + f 2.
4
(5)
Assessment and Experiment
Traditional methods of perceptualization follow the top path of Fig. 3, but our implementation includes haptic feedback in parallel with this, adding another layer of interaction between the user’s senses and the simulation. Our hypothesis is that by adding haptic feedback to a volumetric heteronanostructure visualization, the user will more easily perceive the high magnitudes and the structure’s shape in the simulation environment. By adding another level of complexity, the result could be that the user is overwhelmed with information; thus the haptic device would essentially become useless. To assess the effectiveness of our implementation, a series of tests were formulated and carried out. A set of questions in the form of a questionnaire (see Appendix) was given to a random sample of Fig. 3. Perceptualization model subjects with no prior exposure to haptic devices and no prior knowledge of the heteronanostructure. The experiment was carried out for one week, and a total of 31 participants were tested (16 males and 15 females, average age 22). All subjects were students of different (technical and nontechnical) areas from a large university. They were first introduced to the haptic device by interacting with a Chai 3-D demo application [15], which demonstrated the various methods of force feedback that the Falcon device can exert. They were then asked to visually assess the heteronanostructure using only the computer’s mouse. Once they were familiar with the structure, they were asked to assess it with only the haptic device. The questionnaire asked them to do a simple task: identify the key structural areas with both the mouse and the haptic device. They were asked to respond as indicated in the following scale: Strongly Disagree-1, Disagree-2, Neutral-3, Agree-4, Strongly Agree-5.
5 5.1
Implementation and Results Implementation
Our system is implemented in C++ and uses OpenGL for visualization, Chai 3D for haptic rendering, and CUDA for volumetric ray casting. The haptic device
Augmenting Heteronanostructure Visualization with Haptic Feedback
633
used in our testing was a Novint Falcon (See Figure 1). We have performed all testing on a computer equipped with an Intel Xeon X7500 series processor, 12 GB of RAM, an NVIDIA GeForce 480 GTX graphics card with 1.5 GB of memory, and 64-bit Windows 7. The volume and haptic rendering processes are performed in separate CPU threads; however, the haptic rendering process uses data from the volume rendering. With this implementation, we achieved the following results. Haptic rendering was performed stably at approximately 1kHz. Volume rendering yielded an average of 20 fps per second with a window size 400 × 400 pixels. The most timedemanding operation of the implementation was the trilinear interpolation, and the frame rate could be affected by the step size of the ray casting. 5.2
User Study
To analyze the results, we performed two-tailed paired t-tests between specific questions to compare the average difference of the user’s rankings to purely visual feedback and visouhaptic feedback (see Table 3). The questionnaire used for our testing can be found in Appendix A. Table 1. Statistical comparison and analysis of the visual and haptic feedback Ground Truth (bottom) Ground Truth (middle) Method Similar? Q No. Conf. Interval Similar? Q No. Conf. Interval Visual Y Q1 [2.94, 3.83] Y Q2 [1.89, 2.76] Force 1 N Q4 [2.24, 3.17] Y Q5 [2.41, 3.27] Force 2 Y Q6 [2.80, 3.83] N Q7 [2.84, 3.87] Force 3 Y Q8 [3.07, 4.09] N Q9 [3.41, 4.32]
Table 1 shows a comparison of the conclusion obtained from the results of the user tests and the ground truth (real values from the data) of the magnitude measured at the bottom and middle parts of the nanorod. For example, the visual feedback and the haptic feedback with forces 2 and 3 were perceived as similar to the ground truth for the lower part of the nanorod. Another example from the same table shows that Force 2 and the ground truth for the middle of the structure are different. The reason is that the middle part of the structure does not present a high magnitude of the electric field. The table also includes the number of the corresponding question (see Appendix) and the Confidence Interval for the mean value of the responses to each specific question. From visual examination, users generally were able to correctly conclude that there was a high-magnitude field at the bottom of the cone and a low-magnitude field at the middle. Using Forces 1, 2 and 3 in general, users correctly concluded one of the two questions. For Force 1, the average conclusion was that there was a low field regardless of the position being tested. This might be because the vibrations felt weaker than the damping force, and even though there were high and low vibration amplitudes, users perceived them both as lower than the damping
634
M. Abdul-Massih et al. Table 2. Visual versus haptic results Ground Truth (top) Method Similar? Q No. Conf. Interval Visual N Q3 [3.74, 4.58] Haptic Y Q10 [2.61, 3.52]
Table 3. Questionnaire analysis Summary Avg. Difference t-test p-value Significant Difference? Q1 vs Q4 0.67 t(30)=2.46 0.020 Y Q1 vs Q6 0.06 t(30)=0.22 0.825 N Q1 vs Q8 -0.19 t(30)=0.64 0.526 N Q2 vs Q5 -0.51 t(30)=1.55 0.132 N Q2 vs Q7 -1.03 t(30)=3.16 0.003 Y Q2 vs Q9 -1.54 t(30)=5.36 0.000 Y Q3 vs Q10 1.09 t(30)=3.51 0.001 Y
and ranked any magnitude as a low magnitude. On the other hand, when testing Forces 2 and 3, users generally concluded that a high magnitude field was at both the base and middle locations. Users’ written comments strongly suggested that that problem lay in the difficulty of visually finding the middle of the cone because the depth of the tool proxy could not be perceived well inside the volumetric data. Table 2 shows a similar comparison to the previous table regarding the shape of the top of the nanorod. Results from visuohaptic feedback inclined toward the correct answer (that the top of the cone is curved and not pointed), but those from only a visual inspection were wrong. From question 11, the conclusion can be made that users are not sure if the strictly visual approach would be preferred for analyzing volumetric data with μ = 2.97 σ = 1.28 and a confidence interval [2.52, 3.41]. Most important, users agreed that the haptic device helped them to understand the high- and low-field areas of the data with μ = 3.61 σ = 0.92 and a confidence interval [2.29, 3.93]. Table 3 shows the results of t-tests done for determining if there is a significant difference between pairs of questions concerning visual feedback versus haptic feedback. We can see that there is significant difference between four pairs of questions and no significant difference between three pairs. No significant difference exists between force 2, 3 and visual feedback when evaluating the base of the cone. No significant difference exists when looking at the middle of the cone and using force 1. There is a significant difference between visual feedback and force 1 for the base of the cone. Also, between visual feedback, force 2 and 3, a significant difference can be seen when evaluating the middle of the cone. Finally, there is significant difference in the responses concerning the top of the cone in which haptic feedback aided in concluding that the top of the structure is actually curved.
Augmenting Heteronanostructure Visualization with Haptic Feedback
6
635
Conclusions and Future Work
We have presented an implementation of a system for visuohaptic rendering of force feedback via a haptic device, which complements the visual rendering of a heteronanostructure’s electric field. Our approach translates the magnitude at the approximate position of the proxy within the volumetric dataset into a force. This force is then rendered via a haptic feedback device. A user study was conducted to evaluate the effectiveness of our implementation. Test subjects’ responses were recorded and statistically analyzed to determine significance of the difference between the average rankings given to visual feedback and haptic feedback of three specific zones of the volumetric data. This analysis showed that our implementation of haptic feedback was beneficial in assisting users to more accurately identify key areas within the heteronanostructure and to better understand the shape of the nanorod. There are possible avenues for future work. The position of the proxy was difficult for some users to discern because of the represented depth. Different views of the structure might help to alleviate this issue. A more rigorous test of future implementations to obtain greater statistical significance will be done. In conclusion, datasets from the original simulation other than just the electric field will be used to permit the user, for example, to see the electric field while touching geometry and to interpret temperature as force or as being guided through electric field streamlines.
References 1. Avila, R.S., Sobierajski, L.M.: A haptic interaction method for volume visualization. In: Proceedings of the 7th Conference on Visualization 1996, pp. 197–204 (1996) 2. Salisbury, K., Conti, F., Barbagli, F.: Haptic rendering: Introductory concepts (2004) 3. Colgate, J.E., Brown, J.M.: Factors affecting the z-width of a haptic display. In: IEEE Conference on Robotics and Automation, pp. 3205–3210 (1994) 4. Gregory, A., Mascarenhas, A., Ehmann, S., Lin, M., Manocha, D.: Six degree-offreedom haptic display of polygonal models. In: Proceedings of the Conference on Visualization, VIS 2000, pp. 139–146. IEEE Computer Society Press, Los Alamitos (2000) 5. Ogawa, Y., Fujishiro, I., Suzuki, Y., Takeshima, Y.: Designing 6dof haptic transfer functions for effective exploration of 3d diffusion tensor fields. In: Proceedings of the World Haptics 2009 - Third Joint EuroHaptics Conference and Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems, pp. 470–475. IEEE Computer Society, Washington, DC (2009) 6. Mcneely, W.A., Puterbaugh, K.D., Troy, J.J.: Six degree-of-freedom haptic rendering using voxel sampling. In: Proc. of ACM SIGGRAPH, pp. 401–408 (1999) 7. Pacoret, C., Bowman, R., Gibson, G., Haliyo, S., Carberry, D., Bergander, A., Rgnier, S., Padgett, M.: Touching the microworld with force-feedback optical tweezers. Opt. Express 17, 10259–10264 (2009) 8. Marliere, S., Urma, D., Florens, J.L., Marchi, F.: Multi-sensorial interaction with a nano-scale phenomenon: the force curve. Computing Research Repository (2010)
636
M. Abdul-Massih et al.
9. Sitti, M., Aruk, B., Shintani, H., Hashimoto, H.: Scaled teleoperation system for nano-scale interaction and manipulation. Advanced Robotics 17, 275–291 (2003) 10. Sitti, M., Hashimoto, H.: Teleoperated touch feedback from the surfaces at the nanoscale: modeling and experiments. IEEE-ASME Transactions on Mechatronics 8, 287–298 (2003) 11. Kim, L., Kyrikou, A., Desbrun, M., Sukhatme, G.: An implicit-based haptic rendering technique. In: Proceeedings of the IEEE/RSJ International Conference on Intelligent Robots (2002) 12. Durbeck, L.J.K., Macias, N.J., Weinstein, D.M., Johnson, C.R., Hollerbach, J.M.: Scirun haptic display for scientific visualization. In: Proc. Third Phantom User’s Group Workshop, MIT RLE Report TR624, Massachusetts Institute of Technology, MIT (1998) 13. Ikits, M., Brederson, J.D., Hansen, C.D., Johnson, C.R.: A constraint-based technique for haptic volume exploration. In: Proceedings of IEEE Visualization 2003, pp. 263–269 (2003) 14. Liang, Z.: Simulation and Design of (In,Ga)N-Based Light Emitting Diodes. PhD thesis, Purdue University (2011) 15. Chai3D: WWW, http://www.chai3d.org/ (last accessed, May 2011)
Appendix
An Analysis of Impostor Based Level of Detail Approximations for LIDAR Data Chad Mourning, Scott Nykl, and David Chelberg School of Electrical Engineering and Computer Science Ohio University, Stocker Center Athens, Ohio, USA, 45701
Abstract. LIDAR (Light Detection and Ranging) [1] is a remote sensing technology that is growing in popularity in varied and diverse disciplines. Modern LIDAR systems can produce substantial amounts of data in very brief amounts of time, so one of the greatest challenges facing researchers is processing and visualizing all of this information, particularly in real time. Ideally, a scientific visualization of a set of LIDAR data should provide an accurate view of all the available information; however, sometimes it is beneficial to exchange a small portion of that accuracy for the increased usability and flexibility of a real-time interactive display. The goal of this research is to characterize under what conditions the level-of-detail rendering technique known as impostors [2–4] can effectively optimize the inherent trade-offs between accuracy and interactivity in large-scale point cloud datasets.
1 Motivation LIDAR (Light Detection and Ranging) [1] is a laser ranging technology used for remote sensing in a variety of fields. In recent times, LIDAR has seen increased use as a flexible tool in a diverse set of applications. As an example, LIDAR has been used in: Ecology, to measure river corridor topography [5]; Renewable Energy, to measure wind speeds for optimal wind turbine placement [6]; Athletics, to measure yacht racing conditions at the 2008 Olympic Games [7]; Agriculture, to calculate the sun-facing tilt of land [8]; Robotics, for environmental perception and object recognition [9], and most directly related to this research, Avionics [1]. This increasing use of LIDAR introduces substantial research challenges, in particular, processing the vast amount of data produced by LIDAR and effectively visualizing such large amounts of data. These tasks requires ever increasingly sophisticated and expensive hardware or advanced software techniques that can effectively make use of existing hardware. This paper will focus on the latter.
2 Previous Work Impostors [2–4] have been used for many years as a low-cost replacement for highfidelity models. Impostors are textured quadrilaterals that provide the user with a similar visual effect to polygonal models, but at a much cheaper rendering cost. While G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 637–646, 2011. © Springer-Verlag Berlin Heidelberg 2011
638
C. Mourning, S. Nykl, and D. Chelberg
impostors have been in common usage for over a decade, active research is still being performed, particularly on aspects of parallelization [10] and augmenting impostors with displacement maps [12, 13]. Because this paper deals primarily with ways to quickly approximate a scene while attempting to maintain a high fidelity result, metrics are needed that can provide a way to quantify the similarity between the approximations created using the techniques presented in this paper (see chapter 3) and the non-approximated scene. To quantify our results we will use two image based metrics, a luminosity based metric [14] and a metric based on the Haar wavelet [15]. The Haar wavelet has previously been used for many image compression and processing techniques [16]. More in depth descriptions of these metrics are found in chapter 4. Another common LOD technique for point cloud data is to sub-sample the points, removing a percentage of them. Some application specific operations make this a poor choice of level-of-detail technique. We present impostor based approaches as an alternative that may produce a lower overall error with regards to human perception.
3 Visualization of LIDAR Using Impostors LIDAR data, in its raw form, is a set of 3D points; the size of this set can grow to massive proportions in a relatively short period of time. To visualize tens of millions of LIDAR points at highly interactive frame rates (~20-50 fps), our algorithm utilizes and combines several well known approaches in a new way, namely, frustum culling using octree acceleration, point set subdivision into rectilinear grids, dynamic impostor generation, and an update policy controlling the fidelity / frequency of impostor generation. Prior to rendering, we split the LIDAR point cloud into an arbitrary number of rectilinear meshes of roughly equal size. This way, an impostor for each mesh can be generated versus trying to create one large impostor for the entire point cloud. Furthermore, each mesh is inserted into an octree which is used to accelerate frustum culling on a per frame basis. At each frame, the set of currently visible meshes is generated and given as input into the impostor update policy. The impostor update policy is responsible for minimizing all potential visual errors given the time constraint necessary to yield a highly interactive frame rate. Evaluated impostor update policies are discussed in section 4.2. 3.1 Transformation Equations This section describes the foundational mathematics used to transform a vertex from its original pose (position/orientation) in world space to its final location in screen space given a specific camera and screen resolution. These equations and corresponding matrices are used throughout the sections discussing impostor generation and also describe the fundamental linear transformations performed by a canonical graphics pipeline such as OpenGL. For a more detailed description of these equations and the canonical graphics pipeline see a resource such as The OpenGL Programming Guide [17]. Equation 1 defines the traditional computations performed on a world space vertex vworld to transform it to screen space; V represents the view matrix defining the camera’s position and orientation within the world, P represents the perspective projection
An Analysis of Impostor Based Level of Detail Approximations
639
matrix defining the viewing frustum (this projection matrix may also be orthographic; however, we assume a perspective projection for this paper). NDC represents the normalized device coordinate transformation mapping vndc (normalized device space) to vs (screen space); the NDC transformation details are defined in Eq.2, x and y define the position of the viewport’s lower left corner, and n and f map the depth values from NDC space to screen space; n specifies the near plane while f specifies the far plane, these values are commonly 0 and 1, respectively. vs .x and vs .y correspond to the current pixel location of vworld and vs .z maps to the current depth into the screen from the near plane to vworld . In summary, a vertex vworld (world space) is converted to vs (screen space) via the following equation: ⎤ ⎡1 ⎡ ⎤ ⎡ ⎤ vworld .x w 0 0 0 vs .x ⎥ ⎢ 1 ⎢ ⎥ ⎣vs .y ⎦ = N DC × ⎢ 0 w 01 0⎥ × P × V × ⎢ vworld .y ⎥ (1) ⎦ ⎣0 0 ⎣ vworld .z ⎦ w 0 vs .z 1 0 0 0 1 ⎤ ⎡w ⎤ ⎡ vs .x 2 (vndc .x + 1) + x ⎣vs .y ⎦ = ⎣ h (vndc .y + 1) + y ⎦ (2) 2 f −n f +n vs .z 2 vndc .z + 2 3.2 Impostor Generation Impostor generation involves rendering a distant complex object or objects, obj, into a texture, tex, based on the relative position and orientation between the camera and obj at time t0 . A quadrilateral, q, is then positioned at obj’s location and texture tex is mapped to q. At this point, obj need not be rendered; in its place, the four vertices of q are rendered displaying tex to the viewer. Assuming q is positioned and sized properly and tex is properly sized / filtered, a viewer cannot distinguish between obj and q. Determining the quadrilateral’s optimal position and size as well as the optimal texture size that provides a pixel-accurate re-creation are now described using the equations and matrices described in Sec.3. To find the optimal texture size, such that one screen space pixel maps to exactly one texel of tex, we take the 8 world space vertices comprising obj’s current oriented bounding box, objBndBoxws , and transform these 8 vertices into screen space, using Eq.1. Once in screen space, a tight axis aligned 2D bounding box, objBndBoxss , is generated using the maximal and minimal x and y extrema of these 8 vertices; the z components, representing depth into screen space are ignored; thus objBndBoxss contains 4 vertices. The width and height of this axis aligned 2D bounding box correspond to the number of pixels (+/−0.5 pixels) that obj will occupy, in the X and Y directions, respectively. In other words, each texel in tex will map to exactly one screen space pixel when rendering obj into tex, assuming tex is rendered parallel to the camera’s view plane; this guaranteed one-to-one pixel-to-texel mapping ensures pixel accurate reconstruction and is why we consider this the optimal texture size. Since the dimensions of tex have been determined, we can now render obj into tex thus generating the desired texture for q. Now that tex is of the ideal size and has the proper contents, all that remains is to find the proper size, orientation, and position for q. We know the orientation of q needs
640
C. Mourning, S. Nykl, and D. Chelberg
to be parallel to the view plane to ensure each texel of tex maps to one pixel in screen space; therefore, q needs to be oriented opposite to that of the camera’s pose. This is achieved by taking the camera’s current normalized direction cosine matrix (DCM) and negating the X and Y vector components. This preserves the handedness and ensures texture mapping tex from [0..1] is not flipped or upside down when applied to q in both the u and v directions; this technique also avoids recomputing an orthonormal basis set for q’s DCM. As for finding the proper size and world space position of q such that one texel will map to exactly one pixel (assuming q is parallel to the view plane), we take the 8 vertices in objBndBoxws as well as the oriented bounding box’s world space center point, bboxCenterws and convert these 9 vertices into screen space using the original viewport and projection matrix (prior to generating the contents of tex). Once again we update objBndBoxss ; however, this time, we overwrite the z coordinate (depth from the near plane in screen space) of each of the 4 vertices to bboxCenterss .z, and we also record the perspective divide value, w, of bboxCenterclip . Overwriting the z values enforces that the screen space axis aligned bounding box is parallel to the viewing plane and it also defines the depth of q in screen space; this depth in screen space is used to compute the distance from the camera that q will be positioned in world space. Using these 4 newly computed vertices we compute objBndBoxss ’s center point. At this point we have 5 screen space vertices, 4 defining the edges and 1 defining the center of the screen space bounding box; all 5 have a z value of bboxCenterss .z. Prior to further transformation, one may observe it is only necessary to represent the bounding box using two corner vertices, in our case we arbitrarily chose the upper left and lower right vertices (since we already have the center point, we really only needed to use one corner vertex). We can now transform these 3 vertices (upper left, lower right, and center) from screen space to world space by performing the inverse of Eqs.1,2; we use the perspective scalar w, from bboxCenterclip mentioned above for the inverse of the perspective divide (perspective multiply). Note that the inverse of Eq.2 rewrites the equation solving for vndc instead of vs : The inverse of Eqs.1,2 requires inverting the P ×V matrix; fortunately, the upper left 3x3 submatrix of the view matrix V is a DCM and the first 3 rows of the 4th column represent translation. Therefore, transposing V ’s DCM and negating V ’s translation values (x, y, z) is equivalent to inverting V. Furthermore, the perspective projection matrix P also has an inexpensive analytical solution to compute P −1 [17]. The result of transforming the 3 screen space vertices by the inverse of Eqs.1,2 yields 3 world space vertices. The world space vertex originating from the screen space bounding box’s center is the world space position of q’s center; note that this center point is different than the world space center of the oriented bounding box and achieves more precise results. This increased precision results from mapping the screen space center back into world space using the depth (z, w) values obtained from projecting bboxCenterws into screen space; essentially, we slightly offset q’s world space position from obj’s world space position to compensate for the fact that projecting the screen space axis aligned bounding box back into world space will occupy an area greater than or equal to the area consumed by the oriented bounding box projected onto a plane parallel to the near plane at a distance of objBndBoxeye .z. In other words, the impostor
An Analysis of Impostor Based Level of Detail Approximations
641
will have an area greater than or equal to obj; in the case, where the area is greater, q must be slightly offset in order to achieve the illusion that it is positioned exactly where obj was positioned. This is because q is parallel to the view plane, and obj is not necessarily parallel to the view plane as it is composed of 3D geometry and not a single quad. The remaining 2 world space vertices are subtracted to form a vector lxlyW S = lowerright − upperlef t. This vector stores the optimal dimensions of q; however, it is expressed in world space concatenated with the camera’s DCM as per the inverse of Eqs.1,2. This implies that lxlyW S does not necessarily lie in the current view plane which means the x, y, z components may all have non-zero values. We wish to counter rotate lxlyW S such that it lies in the current view plane causing lxlyW S.x to drop to zero; in our right handed model, Z is up, X is forward, and Y is left, so the Y Z plane is the current view plane. This can be achieved by transforming lxlyW S through the inverse of the camera’s current DCM; since the DCM is an orthonormal basis set of rank 3, its transpose is equivalent to its inverse and the 4th row and column may be set to identity. The transformed lxlyW S now has an x component of 0, a y component who’s absolute value specifies q’s width in world space, and a z component who’s absolute value specifies q’s height in world space. We have successfully generated an impostor; we computed the optimal texture size of tex, the contents inside of tex, q’s optimal position, and q’s optimal width and height without having to perform any costly matrix inversions.
4 Results Using the techniques presented in chapter 3 an approximate scene can quickly be generated. Our algorithm accomplishes this by mixing high fidelity portions of the scene with approximate portions stored from previous frames. These approximate portions are accurate if the camera has not moved since that portion was generated. The primary cause of differences between the approximate scene and the fully rendered scene is camera movement causing a phenomenon known as cracking. On the left of Fig. 1 you can see an example of an approximate scene from the original camera position; on the right, the camera has translated causing cracking. Additional minor errors can acculate from the render-to-texture process, as well as round off (partial pixel) errors from the equations described in chapter 3.
Fig. 1. Left) An approximate scene with no camera motion; Right) a scene with cracking caused by camera motion
642
C. Mourning, S. Nykl, and D. Chelberg
4.1 Experimental Design The following experiments were performed using an NVIDIA GeForce 280M graphics card with 1GB of RAM. Typically, framerates greater than 10 frames per second are considered interactive, so an example LIDAR data set of 10,000,000 data points was sufficient to cause the testing machine to generate only four fully rendered frames per second. For reference, we have access to data sets with hundreds of millions of data points, so the data set used in this example could be considered small. This data set was split into 64 subsets of roughly equal volume called meshes, as mentioned in chapter 3. The more the data set is subdivided, the lower the total per frame error, increasing accuracy; however, additional computational overhead is introduced for every mesh, decreasing interactivity. Using the impostor generation technique presented in chapter 3, combined with suitable error metrics and update policies, a visualization can strike an optimal balance between accuracy and interactivity. To quantify the accuracy of the approximate scene, a preplanned flight path containing 827 frames was created. Each frame has equal pixel height and width, h and w. For these experiments, h = 600 and w = 800. Each fully rendered frame was stored in a baseline image set, B. Additional image sets were captured using the same flightpath but with a different combination of update policy, error metric, and number updates per frame. We created image sets for the least recently used (LRU) update policy [18] and our weighted update policy using U = 1, 2, 4, and 64 (the total number of meshes) updates per frame. The LRU policy updates the U meshes that had been updated least recently every frame; whereas, the weighted update policy scores each mesh based on a weighted criteria that values the last update time at twice the contribution as on screen size, and updates the U meshes with the highest scores. This 2:1 weighting ratio was chosen through minor experimentation. A 1:1 weighting ratio did not produce a good metric, but through additional experimentation an even better weighting ratio may be established. During tests with the 1:1 weighting ratio, some regions were never scored high enough to get updated; this is a real possibility with any metric that is not fair. Two image based metrics, well suited for quantifying human perception, were used to compare these two image sets to generate a quantified error; a simple luminosity difference [14] and a more complicated Haar wavelet based metric [15, 16] that takes into account similarity of different regions rather than the image as a whole. A luminosity based metric was devised based on the ITU-R Recommendation BT.601 [14] luma coefficients. For an RGB pixel the corresponding luminosity value, Y can be determined using Eq. (3). The examples used in this experiment were monochromatic in nature, so the luminosity metric was not entirely necessarily, but it helped the algorithm generalize to other scenarios. Y (R, G, B) = 0.299R + 0.587G + 0.114B
(3)
A Haar Wavelet [15] based image metric, H, operating on two image sets, B, the baseline, and A, the approximation, was implemented as described in Eq. (6), where
An Analysis of Impostor Based Level of Detail Approximations
643
λ = log2 min(w, h) and Xi (m, n) was the RGB triplet representing the pixel in the mth row and nth column in the ith image of image set X: p(B, A, i, m, n) = |Y (Bi (m, n)) − Y (Ai (m, n))|
(4)
w·(k+1)/2j h·(l+1)/2j
P (B, A, i, j, k, l) =
H(B, A) =
m=w·k/2j
n=h·l/2j
number of frames λ i=1
j=0
⎛⎛ ⎝⎝
2 2 j
j
p (B, A, i, m, n) ∗ .5j+1 ⎞
(5) ⎞
P (B, A, i, j, k, l)⎠ /(w · h)⎠
(6)
k=0 l=0
Eq. (4) represents the per-pixel contribution to the overall metric. There are λ inspection levels, and at each level the kernel size decreases (increasing the number of kernels). The larger the kernel size, the greater the weight that is given to that level inspection; such that: the average of the entire image is given 50% of the weight, the averages of the quadrants are given 25% of the weight, the averages of the next level are given 12.5% of the weight, etc. The per-kernel contribution to the overall metric is described in (5). 4.2 Analysis With 64 meshes, the values for the experiment updating 64 meshes per frame are the best possible result our algorithm produces in its current state. As expected, experiments with fewer updates per frame had higher mean error and standard deviation for both error metrics as seen in Table 1. Fig. 2 shows the error for the Haar and Luminosity error metrics using the LRU update policy. Fig. 3 shows both error metrics again, but this time for the weighted update policy. Based on the values seen in Table 1, the Haar Wavelet based metric exhibits the following qualities that make it a better error metric overall than the luminosity based error metric. For the luminosity metric the range of the means was δL = 5.65687 − 5.31108 = .34579, whereas the range of values of the means of the Haar-based metric were δH = 4.04175 − 3.61884 = .42291. This indicates that changes in similarity are more easily perceived by the Haar-based metric. Normalizing these values, over their associated maximums, this separation becomes clearer. δˆL = δL /5.65687 = 6.213 × 10−2 and δˆH = δH /4.04175 = 1.046 × 10−1 . 4.3 Interactivity The goal of this paper was to find an optimal balance between accuracy and interactivity in generating visualizations of large point-cloud datasets. The accuracy of impostors for level-of-detail has been established in the preceding sections. The average framerate over 90 frames with all 10,000,000 points in frustum for each of the experiments in section 4.1 were captured. Both update polices averaged over 104 fps with 1 update per frame, over 54 fps with 2 updates per frame, and over 27 fps with 4 updates per
644
C. Mourning, S. Nykl, and D. Chelberg
Fig. 2. The Haar Wavelet based metric (top 4 lines) and Luminosity metric (bottom 4 lines) plots for each frame for 1, 2, 4, and 64 mesh updates per frame using the LRU policy
Fig. 3. The Haar Wavelet based metric (top 4 lines) and Luminosity metric (bottom 4 lines) plots for each frame for 1, 2, 4, and 64 mesh updates per frame using the Weighted policy Table 1. Left) Means and standard deviations for LRU policy for 1, 2, 4 and 64 updates per frame. Right) Means and standard deviations for weighted policy for 1, 2, 4, and 64 updates per frame. LRU/Weighted Haar Mean Haar σ Luminosity Mean Luminosity σ
1 4.04175 2.57608 5.65687 3.55703
2 3.80513 2.4126 5.46759 3.44508
4 3.71144 2.33173 5.36976 3.37185
64 3.61884 2.2758 5.31108 3.33513
1 3.97095 2.5266 5.60678 3.534
2 3.77042 2.39203 5.4415 3.42804
4 3.67629 2.31943 5.36018 3.36565
64 3.62033 2.27583 5.31565 3.33098
An Analysis of Impostor Based Level of Detail Approximations
645
Table 2. Average Framerates for Each Test 1 2 4 64 LRU Policy 104.163 54.446 27.000 2.000 Weighted Policy 107.587 55.76087 28.15217 2.000
frame. For specific framerates, refer to Table 2. Since the error metrics were only used for evaluation after the fact, and were not part of the update policy itself (although they may be used in future policies), the framerate is the same no matter which error metric is used. As a reminder, the fully rendered scene with 10,000,000 LIDAR points achieved 4 frames per second. This is a substantial increase in interactivity. 64 updates per frame yielded 2 frames per second, but it is intuitive that the 64 mesh update per frame tests, in a scene with 64 meshes, would be slower than the fully rendered scene, because by updating all the meshes, all points in the scene are being rendered, in addition to whatever non-zero time it takes to record the and render the impostors.
5 Conclusion and Future Work This paper has demonstrated how impostors [4] may be used to increase the interactivity, and therefore, usability of large-scale LIDAR visualizations. We have used multiple error metrics and applied multiple update policies to analyze their effectiveness. Based on the values from section 4.2, we conclude that our Haar-based image metric [15] is better suited for the analysis of correctness than a strictly luminosity based error metric [14]. We have also shown that there are measurable improvements when using the view-dependent size information combined with time since last update, instead of an update policy based strictly on time. In the future, we would like to test other error metrics as well as other view-dependent update policies, such as the change in viewing angle since last update. We would also like to examine the effectiveness of using quadrilateral billboarding [19] in addition to impostors. We also suspect that view-dependent texture counter-rotation and perspective warping fields could be taken into account to increase the useful lifetime of impostors. In the future we hope to perform a qualitative analysis by letting human subjects evaluate which of the techniques create an output that appears most similar to them. While human involvement is wonderful for evaluating correctness when human perception is being measured, it cannot be used as part of the input to our update policies, so effective quantitative metrics are still vital.
References 1. Campbell, J., de Haag, M., van Graas, F., Young, S.: Light detection and ranging-based terrain navigation-a concept exploration. In: Proceedings from the, Citeseer (2003) 2. Schaufler, G.: Dynamically generated impostors. In: Proc. GI Workshop on Modeling, Virtual Worlds, and Distributed Graphics, pp. 129–136 (1995)
646
C. Mourning, S. Nykl, and D. Chelberg
3. Shade, J., Lischinski, D., Salesin, D.H., DeRose, T., Snyder, J.: Hierarchical image caching for accelerated walkthroughs of complex environments. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1996, pp. 75– 82. ACM, New York (1996) 4. Maciel, P., Shirley, P.: Visual navigation of large environments using textured clusters. In: Proceedings of the 1995 Symposium on Interactive 3D Graphics, pp. 95–102. ACM, New York (1995) 5. Bowen, Z., Waltermire, R.: Evaluation Of Light Detection And Ranging (LIDAR) For Measuring River Corridor Topography1. JAWRA Journal of the American Water Resources Association 38, 33–41 (2002) 6. Hasager, C., Pe˜na, A., Mikkelsen, T., Courtney, M., Antoniou, I., Gryning, S., Hansen, P., Sørensen, P.: 12MW Horns Rev experiment. Technical report, Risø National Laboratory (2007) 7. Hewett, J.: Optics.org. (2008), http://optics.org/article/34878 8. McKinion, J., Willers, J., Jenkins, J.: Spatial analyses to evaluate multi-crop yield stability for a field. Computers and Electronics in Agriculture 70, 187–198 (2010) 9. Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron, A., Diebel, J., Fong, P., Gale, J., Halpenny, M., Hoffmann, G., et al.: Stanley: The robot that won the DARPA Grand Challenge. The 2005 DARPA Grand Challenge, 1–43 (2007) 10. Lawlor, O.: Impostors For Parallel Interactive Computer Graphics. PhD thesis, University of Illinois (2005) 11. Risser, E.: True imposters. In: ACM SIGGRAPH 2006 Research Posters. SIGGRAPH 2006. ACM, New York (2006) 12. Kaneko, T., Takahei, T., Inami, M., Kawakami, N., Yanagida, Y., Maeda, T., Tachi, S.: Detailed shape representation with parallax mapping. In: Proceedings of ICAT, Citeseer, pp. 205–208 (2001) 13. Tatarchuk, N.: Dynamic parallax occlusion mapping with approximate soft shadows. In: Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games, pp. 63–69. ACM, New York (2006) 14. Recommendation ITU-R BT.601-7. Technical report, International Telecommunication Union (2011) 15. Haar, A.: Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen 69, 331– 371 (1910) 16. Lai, Y., Kuo, C.: A haar wavelet approach to compressed image quality measurement. Journal of Visual Communication and Image Representation 11, 17–40 (2000) 17. Shreiner, D., Woo, M., Neider, J., Davis, T.: OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 2.1., 6th edn. Addison-Wesley Publishing Co., Inc., Boston (2008) 18. Funkhouser, T.: Adaptive display algorithm for interactive frame rates during visualization of complex virtual environments. In: Computer Graphics (Proc. SIGGRAPH 1993), pp. 247– 254 (1993) 19. Tom McReynolds, D.B.: Advanced Graphics Programming Using OpenGL, 1st edn., pp. 257–261. Mogran Kaufmann Publishers, San Francisco (2005)
UI Generation for Data Visualisation in Heterogenous Environment Miroslav Macik, Martin Klima, and Pavel Slavik Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Graphics and Interaction [email protected]
Abstract. The process of data visualisation can be very complex and an urgent need for interactive control of this process is imminent. Solution to this problem is in the field of user interfaces by means of which the user can efficiently control all aspects of the visualisation. One of the fields where a lot of challenges for development of new user interfaces exist is data visualisation in heterogeneous environment. In such case data are visualised on various devices that have different capabilities. This fact influences not only the results of the visualisation process but also design and implementation of user interface for each particular device. In situation when an application is capable to run on various devices it is rather problematic to manually create individual user interfaces, one for each device. In this paper a concept and results of automatic generation of user interfaces in a heterogeneous environment is described.
1
Introduction and Motivation
Data visualisation is traditionally performed either on large (e.g. mainframe) computers or on PCs. With the emerging technologies like mobile computing the visualisation is performed on platforms that were formerly not in focus of researchers. The new devices allow on one hand visualisation of data in formerly unimaginable environment – a good example is visualisation of medical data on mobile devices. On the other hand these new devices represent a sort of challenge as they offer non-traditional interaction methods in comparison with traditional approaches to visualisation. We assume that data visualisation is a heavily interactive process that must take into account not only the explicit actions performed by the user by means of keyboard, mouse or touch screen interaction, but also implicit conditions like environment, user’s goals, etc. User interface (UI) design for a single device is relatively simple - we know the capabilities of the device. The UI design and implementation is (usually) done manually, tailored to device given. Situation is different when we assume multiple devices that differ in their capabilities. In such situations it becomes extremely inefficient to manually tailor UI design and visualisation process for each of them. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 647–658, 2011. c Springer-Verlag Berlin Heidelberg 2011
648
M. Macik, M. Klima, and P. Slavik
In the case of UI design the situation is more complicated. More aspects than just device properties must be considered, at least the properties of the user himself and context in which the interaction takes place. Properties of a user can be of various kinds (cognitive and perceptual capabilities, various disabilities, preferences, etc.). The context is influenced by the environment where the device is located (direct sun, noise, etc.). It is obvious that a relatively large number of UIs could be created in order to satisfy various combination of requirements (device × user × context). In general this large number of UIs can not be created manually. The solution is automatic generation of UIs. 1.1
Use Case
We will demonstrate the automatic UI generation on a simple use case focused on visualisation of routes in a hospital. The way-finding in hospitals is a problem that is being solved in many hospitals around the world. One study of a major tertiary care hospital calculated the annual cost of way-finding at $220,000 mainly due to the time spent directiongiving (more than 4,500 staff hours) by people other than information staff, according to a report from the Robert Wood Johnson Foundation[1]. The goal of our research is to create a flexible navigation platform that provides a patient with necessary information plus some information that can provide him/her with some kind of comfort – e.g how many people are ahead of him at the department (should the patient hurry or not to hurry etc.). The platform should support a large number of various devices including the mobile (smart phones), general purpose PC based navigation kiosk and special purpose terminal in hospitals. While the general purpose kiosk can display a complete information including a 2D map, the special purpose terminal can only show an arrow directing to the corridor the user should go, one line of information text and an emergency button (call for help). As the hospital visitors are of various age (and often people with special needs) it is necessary to take into account their individual properties like poor sight or hearing, motor impairments, etc.). The result of visualisation process is a picture that should be displayed with the highest possible quality on a given device. Interaction with the visualisation process should be adapted to the user, device and environment as the remaining application UI. 1.2
Goals
The main contribution of this paper lies in the method of automatic generation of graphical user interfaces for applications that use visualisation. Our main concern is related to the interaction side of the visualisation process in the heterogenous environment. We can distinguish two main forms of interaction that should be addressed by the platform:
UI Generation for Data Visualisation in Heterogenous Environment
649
– Interaction with the visualisation process - data filtration, manipulation with image like zooming, rotation etc. – Interaction that brings the dialogue process into the next state - asking for the next information, asking for immediate help etc. Both these aspects should be considered when generating proper user interface for particular device, user and environment context.
2
Previous Work
The most successful approaches for automatic user interface generation are based on some kind of formal abstract model[2]. We focus on the most promising and sophisticated approaches using abstract models[2] for specification of the application (or its user interface) as an input like Supple[3], Uniform[4] or iCrafter[5], more details can be found in [6]. ICrafter[5] is a framework for services and their respective UIs in a class of ubiquitous computing[7] environments. Unlike other systems ICrafter use specific UI generators that are usually manually specified. UI generators are in scope of ICrafter routines that generates UI for a particular service or set of services. There are specific UI generators for particular services and user interface languages (target platforms). ICrafter brings an idea of so-called service patterns, which are recognised in the available services. ICrafter then generates user interfaces for these generic patterns. On one hand, this leads to better consistency and easier service aggregation. On the other hand, unique functionality is not available in the aggregated service. Another contribution is involvement of template system into the process of UI generation. Uniform[4] is a user interface generator that strives to take into account consistency of generated user interfaces. The algorithm used identifies similarities between currently generated user interfaces and formerly generated ones. However this is a good idea, explicit mapping between input and output is usually used instead of more sophisticated combinatoric optimisation (see [3]). This solution is limited to Personal Universal Controller project platform, which uses proprietary appliance-oriented language. Supple[3] is based on functional description of a user interface and takes into account both device and user properties. Functional representation of UI says which functionality should be exposed to the user, rather than how (this structure can be called abstract user interface, see further in the text). The UI generation process is an optimisation problem where the algorithm is trying to minimise estimated overall user effort. User behaviour is traced in order to adapt generation of new user interfaces to user’s properties and needs (recognised from prior use or using special motor and cognitive tests). Supple is yet probably the most sophisticated approach for automatic user interface generation, however there are other promising projects (more details in [6]), but none of them is widely used for commercial purposes yet. In the state of the art of existing solutions we can see the following problems: Firstly, the input models are usually very complex, in combination with lack of corresponding
650
M. Macik, M. Klima, and P. Slavik
design tools it decreases their practical usability. Many solutions are also limited to a specific environment or output platform (e.g. Uniform). Approaches discussed also do not provide generic support for internationalization and localisation. None of approaches discussed considers importance of individual UI elements that can be defined in the structure of abstract user interface. Last, but not least, there is no system that combines generic data visualisation and context-sensitive generation of corresponding UI.
3
Our Solution
A general solution for context-aware generation of user interfaces for data visualisation should address specific properties of different applications, users and user interface platforms. The immediate input to the process of UI generation is usually an abstract user interface (AUI). Basically, AUI describes what should be presented to the user (e.g. five pieces of text information and two action triggers) instead of how (text information can be mapped to a label or pronounced by a text reader). Systems based on other models usually use an AUI as an explicit or implicit internal representation. Furthermore, generic approaches always use some kind of user model. User models are used for describing properties, habits and needs of particular users. Another model used is a device model, which describes properties of target user interface device used for the interaction. Together with other models like model of the environment creates the overall context model for the application. In this chapter we describe integration of a modified generator of data visualisation into a generic platform for client-server applications. This platform supports automatic generation of user interfaces. Firstly, the main features of the platform will be described. Secondly, the most important models, namely user and device models and their relation will be described. These models are used for both automatic user interface generation and data visualisation. It is not aim of this paper to describe the visualisation process in detail, however the input and output of the visualisation pipeline is defined (raw data → picture), the main focus is on the platform itself and the added value it provides for a generic visualisation generator. Figure 1 shows the propagation of information through the platform. The application domain is defined using a specific workflow notation – Torsion Workflow Notation[8]. The runtime system is called Torsion Workflow System. This system considers the context, which in our case consists of models of users, target devices and the environment. Using this system, we can produce AUIs that are input to the next component of the platform - UiGE (User interface GEnerator). An AUI (details in [6]) is a hierarchical structure that represents information and action triggers that should be presented to the user. The specific representations of the AUI for particular device (concrete user interfaces – CUIs) are later generated by the UiGE, which uses the same contextual information about target user, device and environment. CUIs are an input to the UIP server, which in further provides various UIP clients with CUIs and the corresponding data. The UIP
UI Generation for Data Visualisation in Heterogenous Environment
Context
Torsion Workflow System
user model
AUI, data
device model
environment model
UiGE generator
CUI, data
UIP Server
image Visualisation Engine
651
UIP Client A CUI, data
CUI, data
UIP Client B
CUI, data UIP Client C
AUI
Fig. 1. Information propagation through the platform
clients are thin clients by design – the application runtime logic is provided by the UIP server. However, complex actions can be taken by components higher in the pipeline (e.g. Torsion Workflow System). Each thin client can be assigned to a specific device where the UI should be used. The Visualisation Engine is connected directly to the UIP Server. It provides a visualisation reflecting current context and restrictions (e.g. dimensions of an area supposed to be used for the visualisation output). Additionally the visualisation can be interactive (e.g. map - scale, translation, level of detail, 2D, 3D). The type of visualisation determines user interface elements that are used for interactive manipulation with the visualised data. Therefore, in our case, the Visualisation Engine provides the platform apart from image data with corresponding AUI definition for interaction with the visualisation process. 3.1
UIP
UIP[9] is a platform for Client-Server applications including heterogeneous client platforms (e.g. desktop, smartphone and TV), supporting ubiquitous computing [7]. UIP supports both communication and platform-independent user interface description. It has XML syntax, but it also has binary version. By applying MVC [10], a clear separation of presentation, model and application logic (which resides on the server by design) is ensured. Although, the UIP clients are thin-clients by design, there is support for animations, and precise definition of elements, etc. in the UIP. It is out of scope of this paper to describe UIP in detail. One of the most important features is its descriptive power. It can describe structure of both AUIs and CUIs, data models and events generated by particular components of the platform. As a result, the UIP is used for communication among all components of the platform. 3.2
Modelling Users and Devices
Complex systems reflecting properties of different display devices and needs or abilities of various users require a well designed user and device model. In our
652
M. Macik, M. Klima, and P. Slavik
device model
device absolute value
user model
x
user's relative value
24 px
1.5
value used
= 36 px
Fig. 2. Example of font-size value computation upon user and device model
case the user and device models are separated as shown in Figure 2. The device model contains absolute values measured for a typical user (in following example the optimal font size is 24 pixels, this value is platform-dependent). Properties in device model have concrete absolute values that fit needs of average healthy user of a particular device platform. In contrast, the user model contains relative values (in following example font size ratio 1.5). Properties in user model have relative values and express the user deviation from the average in a particular aspect. The resulting value is computed as a product of corresponding values in device and user models (in following example the result is 36 pixels for the particular user and device). In this way we can model user and device independently. 3.3
Generating User Interfaces
The input to the process of automatic CUI generation is an AUI. The generation process consists of multiple stages that incorporate the overall context. The general aim is to minimise the estimated user effort needed when interacting with the resulting UI in a particular context. The relationship between AUI elements and parts of CUI to be rendered on a particular UI device is specified by so-called explicit mapping – specification, which CUI elements can represent a particular AUI element. The user’s effort estimation is done using cost function. Each explicit mapping provides particular value of the cost function, which is later used in the process of UI optimisation. The UiGE contains a set of all available mappings. Using the context this set is reduced to a set of feasible mappings. The aim of the following optimisation process is to find a combination of mappings meeting the optimisation criterion of lowest estimated user effort = the lowest value of the overall cost function. The minimum of user effort must be found for the whole generated user interface – a global minimum of cost function must be identified. It can be proved that this is a computationally hard problem, but for most cases, it can be resolved in acceptable time (more details in [3]). As depicted in Figure 3, AUIs, the main input for the UiGE are provided either by Torsion Workflow Server or by the Visualisation Engine. Torsion Workflow Server provides the UiGE with AUIs for the complete application and also sends the necessary data. In contrast, the Visualisation Engine provides image data and AUI bound to a particular data visualisation.
UI Generation for Data Visualisation in Heterogenous Environment
UIP client
UIProtocol server CUI's
UI renderer model manager event manager
data models
events
concrete UI's (CUI's) manager data model manager
AUI's, importance
data models app. events
Torsion Workflow Server
user context upd.
workflow application model context manager
653
User Model Server user context (model)
app. events
user behaviour analyser
application runtime image data context context
AUI's
CUI's
data
AUI
Visualisation Engine
UiGE generator AUI
CUI
Fig. 3. Architecture of the platform
In some cases it is necessary to include complex, man-made structures, usually when it is necessary to strictly specify part of the user interface – e.g. log-in dialog, or when the UI must be mapped to a particular HW – e.g. simple terminals, see bellow. Our platform supports this by implementing special case of mapping – template mapping. For this type of mapping the target structure is more complex (e.g. log-in dialog) and corresponds to multiple elements in an AUI. For some applications it is necessary to optimise consistency of generated user interfaces. For example, structures the user is familiar with can be prioritised. This is possible by using virtual mapping – another type of mapping implemented by our platform. Virtual mapping affects the optimisation process by affecting the value of cost function in order to penalise unwanted structures and prioritise structures convenient in the particular situation. Moreover, metadata information about importance of particular AUI elements is used during the optimisation process by affecting the optimisation function. Information about importance of particular AUI elements is provided by the Torsion Workflow Server as shown in Figure 3. The most important elements should be rendered using the best available representation – best available CUI element and to be easily accessible by the user – element position in the CUI. Elements of AUI that can not be omitted have the highest value of importance = critical importance. The process of user interface generation can result in three different outcomes: – success – all AUI elements with non-zero importance are represented in the resulting CUI, – partial success – all AUI elements with critical importance (elements that can not be omitted) are represented in the resulting CUI, – failure – it is impossible to find a solution for representation of all AUI elements with critical importance. If the UI generation is at least partially successful, the generated UI is transferred to the target user interface platform using UIP and rendered.
654
M. Macik, M. Klima, and P. Slavik
Raw data
Prepared data
1. Data Analysis
Focus data
2. Filtering
Application context e.g. current position
Geometric data
3. Mapping
User input e.g. current zoom level
Image data
4. Rendering
Context (user, device, environment)
AUI
Fig. 4. Modified conventional visualisation pipeline [11] and input on particular stages
3.4
Visualisation and User Interfaces
In our case UI should deal both with interaction with application on general level and also with interaction with image that is result of visualisation. The interaction with the image should be ensured by means of a specific part of UI – using standard process of CUI from AUI generation. This means that part of UI ensuring interaction with an image should be generated in cooperation with visualisation engine. Our platform relies on a conventional visualisation pipeline as shown in Figure 4, which is provided with data to be visualised and with context information. Typical visualisation output is an image, but the representation of the data could vary. It can represent a 2D visualisation of a topology, a map or a view to 3D visualisation of a computer-tomography scan. It is obvious that the user interface for manipulation with different types of visualisation should be different as well. In order to preserve universality of our approach, we integrated the visualisation engine with the interaction by providing UI for interaction with the visualisation process in an context-independent abstract form (AUI). This partial AUI is integrated back into AUI of complete application and later transformed to CUI regarding the context by the UiGE. This approach enables on one hand any type of visualisation with various control user interfaces, on the other hand, it is possible to adapt user interface for interaction with the visualised data to the context (device, user, environment) in the same way as the user interface of the complete application.
4
Results
In this chapter we present results of our pilot implementation. A scenario of navigation in a large hospital has been used as demonstration case. State diagram of interaction with the application depicted in Figure 5 is described in the following text. We used two different terminals mounted at hospital corridor junctions. The first type is a complex navigation kiosk with large screen and multi-touch display. The second type is a very simple terminal, that consists of
UI Generation for Data Visualisation in Heterogenous Environment
655
help identification
confirm
help
end
start intro (a)
help confirmation (c)
navigation (b)
help notification (d)
cancel log-out
cancel / log-out
Fig. 5. State diagram of our demonstration case
a simple one-line display, emergency button and a set of arrows (made of LEDs) that can be lit up in order to mark the direction where the user should go. The interaction consists of four individual steps (Figure 5: a – d), UIs generated for particular terminals correspond to these steps (Figures 6, 7: a – d). The first step (Figures 5 – 7: a) is identification of the user. In the next step, the user is supplied with information in which direction he or she should proceed and with an additional textual information about time needed to reach the destination (Figures 5 – 7: b). At each point, the user can call for help. When this ”alarm” declaration is confirmed (Figures 5 – 7: c), the responsible staff is informed about this situation and also about the current position of the user. A notification about the approaching help is provided to the user (Figures 5 – 7: d). In our demonstration case we generated UIs for two different users. The main difference between them is in their quality of sight, motor abilities, preferred language and estimated cognitive capacity (information complexity coefficient). User 1 has lower quality of sight (about 2 times worse than average), lower cognitive capacity and a mild hand tremble (has problem to point small areas on the touchscreen. The preferred language of User 1 is English. Sight of User 2 is slightly better than average, he has no cognitive or motor problems, his preferred language is Czech. In Figure 6 are depicted screenshots of UIs on complex touchscreen terminal. For each step, CUIs are on-demand automatically generated in runtime from a single abstract model – AUI. In the first step (Figure 6: a), the user is asked to identify himself. A default user model is used for generation of this initial UI (including universal internationalisation model) because a specific user model is not known yet. In the next step (Figure 6: b), the user is identified and a CUI for navigation is generated, properties of both the user and target device are reflected in the resulting CUI. The AUI of this step points to a visualisation of a floor map, therefore an external visualisation engine is used to generate respective image data and corresponding AUI to manipulate the visualisation created. This AUI is transformed into CUI and integrated back into the whole CUI for the navigation step (in our case, buttons ’+’ and ’–’ to manipulate scale of the floor map generated by the visualisation engine). Notice differences between screens generated for respective users. Large fonts and sizes of interactive elements are used for User 1. There is also less remaining space for the visualisation, regarding
656
M. Macik, M. Klima, and P. Slavik User 1
User 2
a)
b)
c)
d)
Fig. 6. Screenshots of demonstration application walkthrough on touchscreen terminal. a) initial screen - identification request, b) navigation screen, c) help confirmation dialog, d) help notification.
this and lesser information complexity coeficient of User 1, only necessary area of the map is shown. For User 1 English version of the internationalisation model is used. User 2 has higher cognitive capacity, perfect sight and no motor problems. These properties are reflected in CUIs generated for User 2. Remaining two steps correspond to the case, when the user declares he or she needs an immediate help. For both users a confirmation dialog is generated (Figure 6: c) and if confirmed a notification about the approaching help is displayed (Figure 6 - d). Currently the available space is not fully used for User 2 (Figure 6: c – d). It is subject of future work to extend the UiGE in order to re-scale the resulting UI to better fit the available space. Nevertheless this simple example shows the flexibility of the approach developed. In Figure 7 are depicted screenshots of UIs on an emulated simple terminal with limited capabilities – see above. As proposed, at least all critical elements of the input AUI must be mapped to available CUI elements (in our case arrows, text label and button). This requirement is satisfied in all steps except help confirmation dialog (Figure 7: c). This step is not crucial for the application and is skipped, the dialog (Figure 7: d) is directly displayed instead. For the simple terminal, the user model affects only internationalisation model that is used.
UI Generation for Data Visualisation in Heterogenous Environment a)
b)
c)
d)
657
Interface generation failure -> skipped (there are critical elements that can not be rendered using this device)
Fig. 7. Screenshots of demonstration application walkthrough on simple terminal (Emulated HW terminal). a) initial screen - identification request, b) navigation screen, c) help confirmation dialog (not rendered, not presented to the user - ommited), d) help notification.
5
Conclusions and Future Work
We described a design and a pilot implementation of a platform for automatic UI generation. We have demonstrated its functionality on an simple application for navigation visualisation in a hospital. We generalised the approach by integrating a generic visualisation engine. We described a seamless integration of automatically generated UI with a special purpose UI elements needed for specific interaction with the visualised data. The automatic UI generation uses a combinatoric optimisation based on estimated user effort needed to use the user interface. The user effort based optimisation is combined with consistency optimisation and usage of user interface design patterns. UI automatic generation, as well as other components introduced in the visualisation process, uses a context model describing the user, device and environment. The application, whose UI is being generated and its data visualised, is described using a specific notation, Torsion Workflow Notation, which is designed to be maximally efficient from the designer’s point of view. The functionality of the automatic UI generation and data visualisation has been demonstrated on a simple use case with satisfactory results. The immediate future work is a user study that verifies the perception of resulting presentation by the users. The implementation of UiGE should be extended by module that re-scales the resulting CUI in order to fully use the available space. Acknowledgements. This research has been done within project Automatically generated user interfaces in nomadic applications founded by grant no. SGS10/290/OHK3/3T/13 (FIS 10-802900).
658
M. Macik, M. Klima, and P. Slavik
References 1. LogicJunction: Logicjunction wayfinder (2011), www.medcitynews.com/2011/04/ logicjunction-aims-to-make-hospital-navigation-easy/ (accessed, 2011-05-11) 2. Goguen, J., Burstall, R.: Institutions: Abstract model theory for specification and programming. Journal of the ACM (JACM), 95–146 (1992) 3. Gajos, K., Weld, D., Wobbrock, J.: Automatically generating personalized user interfaces with supple. Artificial Intelligence 174, 910–950 (2010) 4. Nichols, J., Myers, B., Rothrock, B.: Uniform: automatically generating consistent remote control user interfaces. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 611–620. ACM, New York (2006) 5. Ponnekanti, S., Lee, B., Fox, A., Hanrahan, P., Winograd, T.: ICrafter: A service framework for ubiquitous computing environments. In: Abowd, G.D., Brumitt, B., Shafer, S. (eds.) UbiComp 2001. LNCS, vol. 2201, pp. 56–75. Springer, Heidelberg (2001) 6. Macik, M.: User Interface Generator. Disertation thesis proposal, Czech Technical University in Prague (2011), http://dcgi.felk.cvut.cz/en/members/macikmir/main 7. Bardram, J., Friday, A.: Ubiquitous computing systems. In: Ubiquitous Computing Fundamentals, pp. 37–94. CRC Press, Boca Raton (2010) 8. Slovacek, V.: Methods for efficient development of task-based applications. In: Human-Centred Software Engineering, pp. 206–213 (2010) 9. Slovacek, V., Macik, M., Klima, M.: Development framework for pervasive computing applications. In: ACM SIGACCESS Accessibility and Computing, pp. 17–29 (2009) 10. Krasner, G., Pope, S.: A description of the model-view-controller user interface paradigm in the smalltalk-80 system. Journal of Object Oriented Programming 1, 26–49 (1988) 11. Chi, E.: A taxonomy of visualization techniques using the data state reference model. In: IEEE Symposium on Information Visualization, InfoVis 2000, pp. 69– 75. IEEE, Los Alamitos (2000)
An Open-Source Medical Image Processing and Visualization Tool to Analyze Cardiac SPECT Images Luis Roberto Pereira de Paula1, Carlos da Silva dos Santos2 , Marco Antonio Gutierrez3 , and Roberto Hirata Jr.1 1
Institute of Mathematics and Statistics, University of Sao Paulo {luisrpp,hirata}@ime.usp.br 2 UFABC [email protected] 3 Heart Institute, University of Sao Paulo Medical School [email protected]
Abstract. Single Photon Emission Computed Tomography is a nuclear imaging technique based on measuring the spatial distribution of a radionuclide. One challenge here is the efficient presentation of information, since one single study can generate hundreds of image slices, whose individual examination would be too time consuming. In this paper, we present an open-source medical image processing and visualization tool to analyze cardiac images. The main features of the tool are: 1) an intuitive interface to select and to visualize any slice in different views from a series of spatial and temporal images; 2) a semi-automatic procedure to segment the left ventricle from other structures; 3) an implementation of the polar map visualization (Bull’s eye diagram) that follows recommendations from the American Heart Association. The proposed tool was applied in simulated images generated by a mathematical phantom and in real images.
1
Introduction
Single Photon Emission Computed Tomography (SPECT) is a nuclear imaging technique based on measuring the spatial distribution of a radionuclide. In cardiology, SPECT imaging is widely used to assess myocardial perfusion and left ventricular function. Improved information about the time dependent motion of the myocardium is achieved when the SPECT acquisition is gated to the electrocardiogram (ECG) signal [1]. One challenge in the application of gated SPECT studies is the efficient presentation of information, since one single study can generate hundreds of image slices, whose individual examination would be too time consuming. To address this issue, Garcia et al proposed the polar map [2], also called bull’s eye display (Fig. 1a). The polar map is a representation of the 3D volume of the left ventricle (LV) as a 2D circular plate. Each point of the display, corresponding to a specific region of the myocardium, receives a G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 659–668, 2011. c Springer-Verlag Berlin Heidelberg 2011
660
L.R.P. de Paula et al.
color according to normalized count values. The center of the polar map corresponds to the apical region of the ventricle (Fig. 1b). As we move from the center to the edge of the polar map, each ring in the display corresponds to a circular profile calculated from a certain number of successive short-axis slices. In the last two decades, the polar map became widely used in clinical practice. In 2002, the American Heart Association (AHA) issued a recommendation [3] in order to standardize the display of information from diverse modalities, including SPECT studies and the polar map in particular (Fig. 2). Despite the popularity of the polar map, there is a lack of freely available research tools implementing this type of visualization. In this work we present MIV (an acronym for Medical Image Visualization), a research tool to segment and to analyze SPECT images in cardiac studies. The motivations for MIV are: 1) to develop a semi-automatic image segmentation procedure to extract the LV from other structures in a series of gated SPECT images; 2) to deploy the first (to the best of our knowledge) open-source implementation of the polar map visualization to follow the AHA recommendations. We envision MIV as an integrated tool for the analysis of SPECT images, eventually incorporating other image processing and visualization techniques.
(a) Bull’s Eye
(b) Definition of planes
Fig. 1. (a) The polar maps (Bull’s eye). (b) Definition of planes for displaying tomographic images in cardiology. LV: left ventricle. RV: right ventricle. Adapted from [3].
Following this Introduction, in Section 2, we review some related work. In Section 3, we describe some details of the semi-automatic segmentation procedure implemented in MIV and the implementation of the polar map visualization according to AHA standard. In Section 4, we present some architectural aspects of MIV and show the basic workflow for segmenting a volume and generating the polar map. Finally, in Section 5, we bring additional discussion and perspectives for MIV evolution.
2
Related Work
The original polar map proposed by Garcia et al [2] involved the extraction of maximal-count circumferential profiles from each short-axis image, using a
An Open-Source Medical Image Processing and Visualization Tool
661
Fig. 2. Standard AHA recommendations for polar map display. The LV should be divided into 17 segments, whose names are shown in the figure. Adapted from [3].
hybrid sampling scheme that considers the myocardium as cylindrical in the basal and medial regions and spherical in the apex. Later, Germano et al [4] proposed a 3D sampling scheme, considering the cavity to be an ellipse. Experiments made with this proposed method have shown superiority in terms of diagnosis accuracy when compared with the previous method [5], a claim later questioned by Van Train and Garcia [6]. To the best of our knowledge, there are no freely available implementations of neither methods. Our present method is closer in spirit to the original method of Garcia et al [2], since we relate each region (e.g. medial) with a set of short-axis slices, while in the method proposed by Germano et al [4] the same slice might contribute to two different regions. An interesting method was proposed by Oliveira et al [7] that uses registration with a model to define the different regions of the myocardium and then builds the polar map. Unfortunately, the implementation of this method is not available at the time of writing this work. Segmenting the LV is not a necessary step to create a polar map representation but, if one is successful on this task, other useful information can be extract. This is so true that there are some methods previously reported in the literature for automatic segmentation of cardiac SPECT images [8,9]. However, during our research, we have found some SPECT studies for which these methods proved to be inadequate. The problem happens mainly due to some characteristics of the images that violate implicit assumptions of the reported methods: severe perfusion defects, presence of other anatomical structures located near the LV and with intensity levels close to that of the myocardium. That problem is not usual with SPECT studies; generally the average image intensity over the LV is considerably higher than image intensity over the right ventricle, liver and other anatomical structures.
662
3
L.R.P. de Paula et al.
Methodology
For the kind of images we are dealing with, and also for good ones, we devised a semi-automatic procedure for segmentation based on mathematical morphology segmentation paradigm [10]. This is a powerful way to find the edges of an object of interest in a map of the edges (usually the morphological gradient [11]) of the image. It works by placing a marker (a small connected subset of pixels) inside the object to be segmented (in some cases, an external marker is also necessary). The marker can be provided by the user through a GUI (user oriented segmentation), or provided by a heuristic, or supervised, method based on machine learning (automatic segmentation), or by a composition of automatic and user oriented segmentation (semi-automatic segmentation.) In our case, we want to segment the image in two regions (not necessarily connected): LV and background. To use the paradigm properly in this case we need an internal marker for the ventricle wall and two external markers, one for the cavity and the other for the rest of the background. The automatic part of our procedure consists in a method to find a set of markers that differentiate the LV from the rest of the image. This step is based on a heuristic that combines Otsu’s threshold with Hough transform to find the apex and base slice limits, and also the Hough transform to find the LV limits in the short axis slices. The heuristic also uses the morphological skeleton to place the markers inside the LV. MIV also implements other heuristics and is build in a way that it is easy to plug other segmentation methods. In a subsequent step, the user might edit the provided markers in case the segmentation results are unsatisfactory. Details of this user interaction are given in Section 4, for now we concentrate on the automatic method for finding the markers. Once the markers are defined, we apply the morphological gradient on each short axis slice and the watershed from markers [10], as implemented by the Insight Toolkit [12], thus generating the segmentation of the LV. Throughout our approach, we assume that the heart data has already been re-oriented so the image planes are perpendicular to the long axis of the LV, as recommended in [3]. 3.1
Polar Map Construction
The segmentation of the LV is the starting point for building the polar map in our approach. Since our implementation aims to follow the AHA recommendations, we first make a short summary of those. The recommendations target diverse image modalities (SPECT, echocardiography, magnetic resonance) and aim to make possible intra- and inter-modality comparisons, by standardization of orientation of the heart, angle selection for cardiac planes, names of cardiac planes, nomenclature and location for segments [3]. One concern of this standardization is avoiding excessive resolution in information display, since a display might imply more resolution than would be justified by the image acquisition and/or what would be required in clinical practice. Thus, the standard dictates that the polar map should be generated with 17 segments, divided into apex, apical, mid-cavity and basal regions (Fig. 2), while some clinical applications
An Open-Source Medical Image Processing and Visualization Tool
663
used 20 segments. The apical cap, defined as segment 17, corresponds to the extreme of the muscle, where there is no more cavity. The standard mandates that the rest of the ventricle be divide into three equal thirds [3]. In MIV, we test for the existence of the cavity in the segmented image and attribute the slices where the cavity is not present to the apex. The rest of the slices is distributed equally between apical, mid-cavity and basal regions. Alternatively, the user can produce/establish a local protocol refusing this suggestion from the software and perform his/her own division of the slices between regions. Once the slices are attributed to each region, they are further divided, giving rise to the 17 segments as mandated by the standard, illustrated in Fig. 2. Currently, MIV implements only the maximum polar map, where each segment reflects the maximum count value found in the respective region of the myocardium. But the implementation is flexible enough so that other types of mappings could be easily added.
4
Software Description
In this section, we discuss some aspects of the software and describe the basic workflow for using it. MIV is an open-source software1 written in the C++ language. The software was designed to work in different operating systems, including Linux and other Unix variants, Windows and Mac OS. It uses a set of open-source libraries, such as the Insight Toolkit [13] for image processing and input/output operations, the Visualization Toolkit [14] for image visualization and the QT library for the design of the user interface. Figure 3 presents the workflow of the application to generate a polar map.
Fig. 3. MIV’s workflow to generate the polar map (Bull’s Eye diagram)
The MIV can read 4D images in Analyze and DICOM formats and writes in Analyze format. After loading an image, the user can navigate through frames/ slices using navigation bars. MIV uses the nomenclature of Axial, Coronal and Sagittal axis that is more general than the cardiology specific nomenclature shown 1
The software can be downloaded from: https://github.com/luisrpp/miv
664
L.R.P. de Paula et al.
Fig. 4. Myocardial image in different views in the MIV graphical interface
in Fig. 1 but it should be understood that there is the following correspondence between the terms: Axial–Short Axis, Coronal–Horizontal Long Axis, Sagittal– Vertical Long Axis. Figure 4 shows a myocardial image in different views. Once the image is loaded into the system, the user can apply the segmentation method described before to isolate the LV. Before that, the location of the apex and base regions must be found for each frame in order to select the axial slices corresponding to the LV. The parameters are automatically generated, but the user can reassign slices to either apex or base regions through the graphical interface. At this point the user can choose between two different approaches for automatic markers generation. In case the automatically generated markers are not considered adequate, the user can edit any marker image individually with the Marker Editor (Fig. 5). In addition, the user can add an axial slice to the segmentation by drawing a new marker image for it. Conversely, an axial slice can be removed from the segmentation by erasing its marker. That allows for the correction of errors made in the automatic location of the myocardium. At any time during marker edition the user might check the result by clicking on the preview segmentation button. To reduce the amount of interaction needed in marker edition, the user can also copy the marker from an adjacent slice into the current one. After generating markers for all axial slices of interest, the watershed transform can be applied. The results are displayed on the screen beside the axial view region (Fig. 6). The user can use the navigation bar to check all segmentation results and change a marker if necessary.
An Open-Source Medical Image Processing and Visualization Tool
Fig. 5. Miv Marker Editor interface
Fig. 6. Segmentation result shown in the MIV interface
Fig. 7. Bull’s Eye diagram
665
666
L.R.P. de Paula et al. Table 1. Performance tests ID Sl/Fr Load LVLim AdjLim MkSeg AdjSeg BEye AvgTot 1 29 0.8 5.2 1.6 12.6 149 12.7 181.9 2 29 0.7 5.3 5.6 14.6 173 14.6 213.8 3 27 0.8 5.8 5.6 15.9 91.4 12.4 131.6 4 25 0.5 4.3 2.3 8.3 23.3 10.3 49.0 5 27 0.8 5.5 8.3 11.7 222 21.0 269.3 6 25 0.9 5.0 6.1 16.7 219 12.0 259.7 7 25 0.8 3.0 1.9 7.0 49 15.0 76.7 8 23 0.6 1.9 2.1 6.1 44.6 16.5 71.8 9 32 0.5 5.3 4.1 12.5 19.5 11.8 53.7 10 32 0.5 4.8 3.6 12.4 21.5 10.6 53.4
TS Intev StdInt 5.7 3.3 1.4 6.8 5.5 0.9 4.1 3.8 1.3 1.5 1.3 0.5 9.4 6.9 1.5 9.6 6 1.6 2.7 0.8 0.6 2.8 1.3 0.8 1.2 0.5 0.5 1.2 0.9 0.6
The polar map can only be generated after the image is segmented. The polar map construction is divided into two steps: 1. Axis and radius definition. The program suggests an axis that should pass through the center of the myocardium and a radius for a circle that fits the LV in all selected axial slices. Each frame has its own axis and radius that will guide the generation of the polar map. The user can visually check the suggestion against the axial slices and accept it, or override it by assigning a new axis with the mouse and a new radius using a text box. 2. Polar map generation. Finally, the Bull’s Eye diagram is generated and shown in the interface (Fig. 7). Each frame will have its own polar map.
5
User Tests
The performance of the software has been tested using a real dataset of eight exams and also in a simulated dataset generated by a software phantom of two exams. Table 1 shows the results for all runs. The exams are labeled 1 to 10 for reference (ID). Exams 1 to 6, 9 and 10 have 8 frames each. Exams 7 and 8, 15 frames each. Exams 9 and 10 are the simulated dataset. Column SlFr presents the number of slices per frame for each exam. Column Load presents the time (in seconds) to load each frame. Columns LVLim and AdjLim present the time (in seconds) to compute the limits of the LV automatically and a mean time (for all frames of the exam) to adjust the limits in case the heuristics fails. Columns MkSeg and AdjSeg present the time (in seconds) to compute the markers and apply the segmentation paradigm automatically and a mean time (for all frames of the exam) to adjust the markers in case they are misplaced. Columns BEye, AvgTot and TS present the time (in seconds) to compute the Bull’s Eye, the average total time per frame and the average time per slice. Finally, column Interv and StdInt present the mean number of interventions per frame and its standard deviation. As one can notice, the exams that took longer to process (number 5 and 6) are the ones that the user had to interfere more (almost 7 interventions per frame).
An Open-Source Medical Image Processing and Visualization Tool
667
The usual time to analyze an exam, even when some intervention is necessary, is compatible with the time to check if the automatic segmentation heuristics is correct. As mentioned before, an intervention to correct a misplaced marker is very fast because one only need to clear the part of the marker that is misplacing the segmentation and draw it in a better place, or copy the marker from an adjacent slice.
6
Conclusion
In this paper we presented an open source medical image processing and visualization tool (MIV) to analyze cardiac SPECT images. We described some details of the semi-automatic segmentation method to extract the left ventricle from other structures. The solution is based on the combination of Hough Transform and a morphological approach with different markers to select the structure of interest. The user can also adjust any automatic parameter selected by MIV. Although the MIV workflow tries to minimize the amount of user intervention necessary for the segmentation process, there are still cases where segmenting a whole study might be too time consuming. We are improving MIV to reduce the amount of user interaction by taking more advantage of the inherent redundancy between markers of adjacent slices. We tried to keep the MIV interface fully flexible, allowing the user to override any automatic suggestion made by the software, recognizing that no totally automatic procedure can be accurate in all cases. It should be noted that our procedure for finding markers could be adapted to work with other segmentation methods that require the definition of regions of interest, like region growing methods. To represent the information obtained after image segmentation task in a synthetic form, MIV provides an implementation of the polar map visualization diagram. There is no other freely available tool to represent the segmented SPECT images in polar map. The future steps of this work include extensively evaluation of the proposed tool with simulated and real images. Until that, we should note that MIV must be considered a research utility; it is not yet mature enough for clinical use. Acknowlegments. The last author is partially supported by CNPq.
References 1. Gutierrez, M.A., Rebelo, M.S., Furuie, S.S., Meneghetti, J.C.: Automatic quantification of three-dimensional kinetic energy in gated myocardial perfusion single-photon-emission computerized tomography improved by a multiresolution technique. Journal of Electronic Imaging 12, 118–123 (2003) 2. Garcia, E., Train, K.V., Maddahi, J., Prigent, F., Friedman, J., Areeda, J., Waxman, A., Berman, D.: Quantification of rotational thallium-201 myocardial tomography. Journal of Nuclear Medicine 26, 17–26 (1985)
668
L.R.P. de Paula et al.
3. Cerqueira, M.D., Weissman, N.J., Dilsizian, V., Jacobs, A.K., Kaul, S., Laskey, W.K., Pennell, D.J., Rumberger, J.A., Ryan, T., Verani, M.S.: Standardized myocardial segmentation and nomenclature for tomographic imaging of the heart: A statement for healthcare professionals from the cardiac imaging committee of the council on clinical cardiology of the american heart association. Circulation 105, 539–542 (2002) 4. Germano, G., Kavanagh, P.B., Waechter, P., Areeda, J., Kriekinge, S.V., Sharir, T., Lewin, H.C., Berman, D.S.: A new algorithm for the quantitation of myocardial perfusion SPECT. I: Technical principles and reproducibility. Journal of Nuclear Medicine 41, 712–719 (2000) 5. Sharir, T., Germano, G., Waechter, P.B., Kavanagh, P.B., Areeda, J.S., Gerlach, J., Kang, X., Lewin, H.C., Berman, D.S.: A new algorithm for the quantitation of myocardial perfusion SPECT. II: Validation and diagnostic yield. Journal of Nuclear Medicine 41, 720–727 (2001) 6. Van Train, K.F., Garcia, E.V., Germano, G., Areeda, J.S., Berman, D.S.: New Algorithm for Quantification of Myocardial Perfusion SPECT. J. Nucl. Med. 42, 391–392 (2001) 7. Oliveira, L.F., Zanchet, B.A., Barros, R.C., Sim˜ oes, M.V.: A new approach for creating polar maps of three-dimensional cardiac perfusion images. In: Gon¸calves, L., Wu, S.T. (eds.) Proceedings, Porto Alegre, Sociedade Brasileira de Computa¸ca ˜o (2007) 8. Ezekiel, A., Train, K.V., Silagan, G., Maddahi, J., Garcia, E.V.: Automatic determination of quantification parameters from Tc-sestamibi myocardial tomograms. In: Computers in Cardiology, pp. 237–240 (1991) 9. Moro, C., Moura, L., Robilotta, C.C.: Improving the reliability of bull’s eye method. In: Computers in Cardiology, pp. 485–487 (1994) 10. Beucher, S., Meyer, F.: 12. The Morphological Approach to Segmentation: The Watershed Transformation. In: Mathematical Morphology in Image Processing, pp. 433–481. Marcel Dekker, New York (1992) 11. Soille, P.: Morphological Image Analysis: Principles and Applications, 2nd edn. Springer, Heidelberg (2004) 12. Beare, R., Lehmann, G.: The watershed transform in ITK - discussion and new developments. The Insight Journal (2006), http://hdl.handle.net/1926/202 13. Yoo, T.S., Metaxas, D.N.: Open science – combining open data and open source software: Medical image analysis with the insight toolkit. Medical Imaging Analysis 9 (2005) 14. Schroeder, W.J., Avila, L.S., Hoffman, W.: Visualization with VTK: a tutorial. IEEE Computer Graphics and Applications 20, 20–27 (2000)
CollisionExplorer: A Tool for Visualizing Droplet Collisions in a Turbulent Flow Rohith MV1 , Hossein Parishani2, Orlando Ayala2 , Lian-Ping Wang2 , and Chandra Kambhamettu1 1
Video/Image Modeling and Synthesis (VIMS) Lab, Department of Computer and Information Sciences http://vims.cis.udel.edu 2 Department of Mechanical Engineering, University of Delaware, Newark, DE, USA
Abstract. Direct numerical simulations (DNS) are producing large quantities of data through their results. Though visualization systems are capable of parallelization and compression to handle this, rendering techniques which automatically illustrate a specific phenomena hidden within larger simulation results are still nascent. In a turbulent flow system, flow properties are volumetric in nature and cannot be displayed in their entirety. Identifying sections of the field data which contain typical and atypical interactions offers a convenient tool to analyze such data. In this paper, we propose methods to explore collision events in DNS studies of droplet collisions in a turbulent flow. Though a variety of geometric models of collisions exist to explain the collision rate, there are few tools available to explore the collisions that really occur in a simulated system. To effectively understand the underlying processes that facilitate collisions, we observe that a global view of all the collisions is required with respect to certain chosen flow parameters together with detailed 3D rendering of the trajectory of a particular collision event. We use GPU based rendering of isosurfaces and droplet trajectories to create such visualizations. The final tool is an interactive visualizer that lets the user rapidly peruse the various collision events in a given simulation and explore the variety of flow characteristics that are associated with it.
1
Introduction
Direct numerical simulations (DNS) are being used as a quantitative tool for exploring a variety of physical phenomena [6] with advent of high performance computing resources. By encoding elemental rules of interaction of various physical entities, complex and emergent phenomena can be explored. With increasing
This work was supported by the National Science Foundation through grants OCI0904534 and ATM-0730766 and by the National Center for Atmospheric Research (NCAR). NCAR is sponsored by the National Science Foundation. Computing resources are provided by National Center for Atmospheric Research through CISL35751010, CISL-35751014 and CISL-35751015.
G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 669–680, 2011. c Springer-Verlag Berlin Heidelberg 2011
670
Rohith MV et al.
complexity and size of simulations, visualization forms an integral part in the process of scientific discovery [10]. Once simulation results are obtained for a particular set of parameters/conditions, scientists use statistics and visualizations of the output to form new hypotheses based on what they observe. The simulations may then be repeated with parameters/conditions which better illustrate the case for or against the hypotheses. Such an approach to discovery is being pursued with a new vigor in the areas of fluid dynamics, particle physics and human behavioral studies. Such studies have resulted in not only a series of exploratory and analytic visualization tools [5,11], but also frameworks which use parallel computing to support visualization of large-scale datasets [13,8,9]. Visualization frameworks often offer the user a host of options to control the rendering of data including, filtering of data based on multiple attributes, transparency and color controls, viewpoint and field-of-view controls to effectively study a chosen phenomena. These serve to demarcate the region of interest and make the required aspect of the data more perceivable to the human observer. In fact, there has been significant research into the design of non-photorealistic rendering techniques for this purpose. However, if the phenomena to be observed exists in a small spatial and temporal interval, rendering techniques alone may not be able to highlight them effectively. Several query driven approaches [7,5] have been proposed to extract relevant data as defined by the human observer, but these require manual specification of regions of interest. In this paper, we are concerned with observing characteristics of flow that surround collision events between freely suspended droplets in a turbulent flow system. Given data from a dynamic spatio-temporal simulation with droplets of multiple radii, we attempt to answer the following questions * What is the distribution of collision events with respect to flow parameters? * What are the flow structures that surround a given collision event? There are several works which deal with effect of turbulence on collision through computer simulations. It has been shown that air turbulence can increase the geometric collision kernel by up to 47%, relative to geometric collision by differential sedimentation [1]. The collisions are studied to determine the effect of various flow and droplet parameters such as the flow Reynold number, droplet radius, inertia and settling velocities. These help to quantify the collision rates when the flow parameters are changed. However, most of the estimated entities such as the collision kernel are statistical in nature and do not allow for an intuitive mechanistic understanding of individual collision events. In order to better understand the underlying physical processes that lead to collisions, we observe that a global view of all the collisions is required with respect to certain chosen flow parameters together with detailed 3D rendering of the trajectories of a particular colliding droplet pairs. In our tool - CollisionExplorer, we propose to display a global picture of all the collisions with respect to chosen flow attributes such as the local vorticity or the dissipation rate. In this the user can see the distribution of collision events and pick the event that he is interested in studying. Once a particular collision event is selected, we use a combination of 2D and 3D rendering techniques to display the features surrounding that
CollisionExplorer: A Tool for Visualizing Droplet Collisions
671
collision. We show that selective rendering of local structures around an event helps in providing a richer understanding of collision mechanism. We discuss selected related work in Section 2. We describe our approach with results in Section 3 and conclude in Section 4.
2
Previous Work
We group our discussion of previous work under three separate subsections. We begin with a review of visualization systems for large-scale data, then briefly comment on some solutions specific to fluid dynamics visualizations. 2.1
Large-Scale Visualization Frameworks
Paraview [8], based on Visualization Toolkit (VTK) offers a scalable solution to large-scale data visualization with support for distributed computing resources. Users can create application specific visualizations by using predefined filters or custom scripts. Though it offers several predefined filters and a flexible framework for rendering, some programming is required to create visualizations which illustrate a specific phenomena. VisIt [13] is a scaleable visualization framework which offers analytic tools such as line-out and query. Line-out allows the user to plot the profile of volumetric data along a specified curve, whereas query is a text-driven tool to select regions of data using value interval specifications. These tools are excellent for studying data which varies on structured surface or volume (pressure on a wing, temperature inside an engine), but is of little help when studying dense volumetric data (such as vorticity structures in a turbulent flow). VAPOR [9] is a tool created specifically for studying atmospheric and solar data. Instead of a text-based data selection, it offers a graphical selection of appearance histograms in direct volume visualization. Users can vary color and transparency maps to highlight required sections of data, planar probes can be introduced to better illustrate a localized phenomena. However, the appearance parameters are spatially and temporally uniform limiting the extent to which required region can be highlighted for better perception. The selection of parameters can be done based on the histogram of data values or manually - there is no support for saliency driven parameter selection. Also, since it deals almost exclusively with volumetric data, the system has limited support for particle trajectories and related computations. 2.2
Fluid Dynamics Visualizations
Isosurfaces of vorticities, streamlines of fluid velocities and direct volume rendering of velocity/vorticity map are some of the methods commonly used to study turbulent flow structures. These techniques are used to study various turbulence and other flow related phenomena [3]. There are very few visualization works which deal with suspended particles in the fluid. To visualize the interaction of particles with the fluid, the particle positions are superposed on the vorticity/velocity structures. However, most of the visualization plots which capture
672
Rohith MV et al.
the relation between the particle cluster positions with respect to flow structures are one-dimensional plots. They show the probability of cluster formation with respect to mean vorticity value in a region, or particle settling velocities with respect to flow parameters. Though they quantitatively capture the effect of the interaction, they fail to depict the interaction itself. Since most of these plots are based from statistics that is derived from the simulation output, it is difficult to back-track and identify individual portions of data that illustrate these phenomena. A direct volume rendering of dense data such as vorticity/velocity values is not always convenient to identify specific instances of desired behavior as it is often buried under layers of data which have little relevance. A common method for handling such a scenario is to define a transfer function that maps the data values to colors and transparencies. By controlling the opacities associated with different range of data values, one can see through the irrelevant layers. Though useful for understanding flow structures that arise in various conditions, this method has three drawbacks: (i) the transfer function needs to be manually specified, so it may take several passes to get the parameters right for a particular phenomena (ii) the transfer function is almost always spatially and temporally uniform limiting the ability to highlight specific regions (iii) the mapping is between data values and colors/transparency which makes it hard to select a transfer function based on interaction between different attributes (e.g., interaction of particle velocity with flow velocity). Hence, we use an adaptive isosurface rendering scheme to render isosurfaces that are relevant to the given interaction. There has been some recent work in visualizing the particle/flow structure interaction with bronchial tubes [12]. The authors provide a visualization showing cross-flow velocities with respect to fluid velocity magnitude inside the tubes. Since their motivation is to study distribution of particle depositions, the methods developed cannot be easily extended to cases of freely suspended particles.
3
Our Approach
As noted in Falkovich and Pumir [4], there are two contributions to collisions in turbulent systems - the interaction of the droplets with the local flow shear and the non-local interaction with a distant flow structure due to fluid accelerations and particle inertia. At lower Reynold numbers, the local interaction dominates. In this work, we aim to visualize the local interactions that lead to collision. To obtain an overall picture of the various collisions we introduce a global map which provides the information regarding the flow statistics of colliding droplet pairs in their trajectory. Such a representation helps us study the correlation between the values encountered by the colliding pair. In this overall picture, which we call global map, user can scroll over the various collisions to pick one that has characteristics important to the chosen effect. Once this choice is made, the user can see 2D and 3D maps, which we call region maps, specific to the chosen collision. We use selective isosurface rendering to highlight the structures that may have interacted with the droplets during collision. However, isosurfaces
CollisionExplorer: A Tool for Visualizing Droplet Collisions
673
provide information about only a single value in the volume. To provide a dense representation of the structure around collision we also create a 2D map of the surrounding region by choosing the plane which encompasses most of the motion at collision. Before we explain the details of our approach we provide some information about the simulation and the data generated. Results of each method are provided in the corresponding subsections. 3.1
Simulation Details
The data used for visualization were obtained by direct numerical simulation of turbulent collisions conducted on a parallel computer (Bluefire) at NCAR. The motion of water droplets of various radii suspended in turbulent air flow was simulated. The flow simulations were performed by solving the Navier-Stokes equations in the spectral domain. The droplet motion is governed by gravity, inertia and drag force. In our simulations, only the droplet behavior was affected by the fluid flow, but the fluid flow was not influenced by droplets. The simulations were conducted on a grid of 256 spatial points in each direction with periodic boundary conditions. Since the dissipative scales of the flow are fully resolved, the physical dimension of each side of the cube depends on the grid resolution. Data generated contained one million droplets which did not interact with each other. The droplets were a bidisperse mixture of 20 and 40 micron droplets. The data was obtained via 3000 time steps of simulation sampled at every 15 steps. There were a total of seventeen thousand collisions recorded with approximately 15,500 collisions between a droplet of 40 microns and another of 20 microns, and 1,500 collisions where both the colliding droplets were 40 microns. Collisions among 20 micron droplets were fewer than fifty. The turbulence has reached a statistically stationary state by running six eddy turnover times before obtaining flow and droplet data. The details of the simulation can be found in [2]. 3.2
Global Map
As noted earlier, the function of the global map is to visualize the overall distribution of a statistics up to the point of collision and also provide a correlation between the values encountered by a particular colliding pair. The data about a collision event contains a number of elements that may be abstracted away for the sake of statistical analysis. Since the simulation volume is considered to be homogeneous, the exact location of a collision is not important. Also, since the characteristics of the flow are nearly time invariant, the exact time step of the collision can be omitted. Instead we concentrate on characteristics of flow at the point of collision, for example, the vorticity and dissipation rate values at the point of collision. Hence we need to plot these flow attributes along droplet trajectory. Since the absolute location of the droplet cannot be used, we use the distance between the colliding pairs as the parameter. Hence if we plot the flow attributes versus the relative distance between the colliding pairs, we ensure a space-time invariant representation. It must be noted that since the boundary
674
Rohith MV et al.
Fig. 1. Global map of collisions based on vorticity values. Horizontal axis is the relative distance between the colliding pair. Vertical axis is the vorticity value at droplet location.
conditions are periodic (a droplet exiting the volume on one face will reappear on the opposite face), we must ensure that the distance between the pair of droplets is measured within a single volume. For this, volume exits and entries are noted for every droplet and part of the trajectory is shifted to ensure that the trajectory is continuous. In the global map, if the horizontal axis were to represent the distance between the pair and the vertical axis represented a flow attribute say vorticity of each droplet in a colliding pair - the pair of droplets would start out away from the vertical axis and then move left. At collision, the distance between them would be zero, hence this point would lie on the vertical axis. The vertical offset of this point will indicate the flow attribute at point of collision. Examples of this are shown in Figure 1 and Figure 2 for vorticity and dissipation rate global maps for collisions between 40-20 micron colliding droplets. It can be seen that an overall distribution of flow attribute now emerges. For example, from Figure 1 we can infer that collisions are less likely in regions containing high vorticity values. The user can then select a pair whose characteristics need to be studied and the region maps for those will be generated as described in the next section. 3.3
Region Maps
We generate two kinds of region maps. A 3D map showing the trajectory of droplets in the simulation volume together with isosurfaces. This enables us to visualize the structures which may have interacted with droplet at time of
CollisionExplorer: A Tool for Visualizing Droplet Collisions
675
Fig. 2. Global map of collisions based on dissipation rate values. Horizontal axis is the relative distance between the colliding pair. Vertical axis is the dissipation rate at droplet location.
collision. However, since we cannot obtain an idea of the dense field of the flow attribute in this region from isosurface alone, we project the trajectory onto a plane containing the most of the trajectory. The values of required attributes are obtained on this surface using interpolation. When this is visualized it provides us with a dense representation of the region. Dense 2D region map. To obtain the dense 2D map, we have to decide a plane whose visualization contains the most information regarding the collision event. Consider the pair of droplets as in Figure 3, the droplets are located at positions P1 and P2 with their instantaneous velocities V1 and V2. Since the plane must contain the local trajectory of the droplets, it should preferably contain both V1 and V2. Since the relative position of the droplets must not be distorted, it must contain P1-P2. However, all these constraints cannot be satisfied by a single plane, hence we use an plane which approximately satisfies these constraints. Its formulation may be obtained in closed form as a plane passing through the midpoint between the droplet positions and containing the vectors (V1+V2) and (P1-P2). In cases where this leads to a degenerate solution, the plane from the previous timestep is used. Figure 4 shows the calculated plane for a pair of trajectories. Note that it lies along the trajectories and provides the least distortion for their projections. Figurer 5 shows the slice of vorticity data sampled along that plane. The particle positions are also indicated.
676
Rohith MV et al.
Fig. 3. Two droplets at positions P1 and P2 with instantaneous velocities V1 and V2
Fig. 4. Droplet trajectories and the plane estimated for projection
Selective isosurface rendering. As seen in Figure 6, rendering isosurfaces together with droplet trajectory does not provide us with a clear method of visualizing the interaction between the droplet and the flow. Hence we propose a method called selective isosurface rendering, in which only those surfaces near a given point of interest are rendered. Since the surfaces must not be truncated or distorted, we cannot just filter the vertices of the isosurface mesh based on distance. We perform connected component analysis on the isosurface mesh to obtain segments of faces which are connected together. Once this is performed, we can choose seed points as those vertices which lie close to the point of interest. Only the segments that contain the seedpoints are rendered. This framework was implemented using Nvidia CUDA library. It was tested on an Intel Pentium Core 2 Duo PC with 3GB RAM and Nvidia GeForce 570 video card with 1.25GB memory. The vorticity data and the particle trajectories were obtained from the simulation. The data consisted of vorticity values on a 256 × 256 × 256 grid. The isosurface estimation was carried out using the marching cubes implementation
CollisionExplorer: A Tool for Visualizing Droplet Collisions
677
Fig. 5. Droplet positions plotted on the vorticity data sampled from the estimated plane. Darker colors indicate lower vorticity. The red and blue markers correspond to the droplets seen in Fig 4.
Fig. 6. Isosurface of vorticity (at a value of three times the average vorticity) with droplet trajectories
678
Rohith MV et al.
Fig. 7. Selected isosurfaces of vorticity at point of collision (at a value of three times the average vorticity) with droplet trajectories
Fig. 8. Time summary of droplet flow structure interaction. The flow structures around three points highlighted in white are shown using isosurfaces from the corresponding timesteps.
CollisionExplorer: A Tool for Visualizing Droplet Collisions
679
provided in the library. It was suitably modified to include the connected component analysis and segment selection. The algorithm was further optimized by selecting the faces directly using face-walking (traversing adjacent connected faces). The graphics was rendered at a frame rate of 20 fps in 1024x1024 resolution. Figure 7 shows the result of the same data with selected isosurface rendering. We can extend this method to combine selected isosurfaces at different timesteps to create a time summary of the flow interaction. This is shown in Figure 8.
4
Conclusion
Visualizing results of large-scale simulation studies grows more challenging with increasing complexity and scale range of experiments. Most visualization systems require extensive user interaction, in-depth domain knowledge and manual scripting to generate illustrations of desired phenomena from the simulation results. Localizing the region of interest in time and space requires manually sifting through the entire dataset which can be laborious and time consuming. Study of turbulent flow systems is an area that has been actively explored using numerical simulations. In this paper, we report a tool called CollisionExplorer which allows users to interactively explore the various collision events that occur in a simulation. We provide the user with a global map which contains the summary of all the collisions and the user can then choose a specific collision based on his interest. Once a particular collision is chosen, we provide 2D and 3D region maps which displays the sparse and dense representations of flow near the region of collision. Using plane projection and selective isosurface rendering we are able to provide efficient representations of dense volumetric data. In the future, the global map may be made more discriminative with respect to collision mechanisms and more flow attributes may be included in both global and region maps.
References 1. Ayala, O., Rosa, B., Wang, L.P., Grabowski, W.W.: Effects of turbulence on the geometric collision rate of sedimenting droplets. Part 1. Results from direct numerical simulation. New Journal of Physics 10(7) (July 2008) 2. Ayala, O., Grabowski, W.W., Wang, L.P.: A hybrid approach for simulating turbulent collisions of hydrodynamically-interacting particles. J. Comput. Phys. 225, 51–73 (2007) 3. Clyne, J., Mininni, P., Norton, A., Rast, M.: Interactive desktop analysis of high resolution simulations: application to turbulent plume dynamics and current sheet formation. New J. Phys. 9, 301 (2007) 4. Falkovich, G., Pumir, A.: Sling effect in collisions of water droplets in turbulent clouds. Journal of the Atmospheric Sciences 64(12), 4497–4505 (2007) 5. Gosink, L.J., Anderson, J.C., Bethel, W., Joy, K.I.: Variable interactions in querydriven visualization. IEEE Trans. Vis. Comput. Graph. 13(6), 1400–1407 (2007) 6. Johnson, C.R., Huang, J.: Distribution-driven visualization of volume data. IEEE Trans. Vis. Comput. Graph. 15(5), 734–746 (2009)
680
Rohith MV et al.
7. Kendall, W., Glatter, M., Huang, J., Peterka, T., Latham, R., Ross, R.: Expressive feature characterization for ultrascale data visualization. Journal of Physics (2010) 8. Law, C.C., Henderson, A., Ahrens, J.: An application architecture for large data visualization: a case study. In: Proceedings of the IEEE 2001 Symposium on Parallel and Large-data Visualization and Graphics, PVG 2001, pp. 125–128. IEEE Press, Piscataway (2001) 9. Lum, E., Ma, K.L., Clyne, J.: A hardware-assisted scalable solution for interactive volume rendering of time-varying data. IEEE Transactions on Visualization and Computer Graphics 8(3), 286–301 (2002) 10. McCormick, P., Anderson, E., Martin, S., Brownlee, C., Inman, J., Maltrud, M., Kim, M., Ahrens, J., Nau, L.: Quantitatively driven visualization and analysis on emerging architectures. Journal of Physics: Conference Series 125(1), 012095 (2008) 11. Roberts, J.C.: Exploratory visualization using bracketing. In: Costabile, M.F. (ed.) AVI, pp. 188–192. ACM Press, New York (2004) 12. Soni, B., Thompson, D., Machiraju, R.: Visualizing particle/flow structure interactions in the small bronchial tubes. IEEE Transactions on Visualization and Computer Graphics 14, 1412–1427 (2008) 13. Weber, G.H., Ahern, S., Bethel, E.W., Borovikov, S., Childs, H.R., Deines, E., Garth, C., Hagen, H., Hamann, B., Joy, K.I., Martin, D., Meredith, J., Prabhat, Pugmire, D., R¨ ubel, O., Van Straalen, B., Wu, K.: Recent advances in visit: Amr streamlines and query-driven visualization. In: Numerical Modeling of Space Plasma Flows: Astronum-2009 (Astronomical Society of the Pacific Conference Series), vol. 429, pp. 329–334 (2010) lBNL-3185E
A Multi Level Time Model for Interactive Multiple Dataset Visualization: The Dataset Sequencer Thomas Beer1 , Gerrit Garbereder1 ,Tobias Meisen2 , Rudolf Reinhard2 , and Torsten Kuhlen1 1
Virtual Reality Group, Institute for Scientific Computing 2 Information Management in Mechanical Engineering RWTH Aachen University
Abstract. Integrative simulation methods are used in engineering sciences today for the modeling of complex phenomena that cannot be simulated or modeled using a single tool. For the analysis of result data appropriate multi dataset visualization tools are needed. The inherently strong relations between the single datasets that typically describe different aspects of a simulated process (e.g. phenomena taking place at different scales) demand for special interaction metaphors, allowing for an intuitive exploration of the simulated process. This work focuses on the temporal aspects of data exploration. A multi level time model and an appropriate interaction metaphor (the Dataset Sequencer) for the interactive arrangement of datasets in the time domain of the analysis space is described. It is usable for heterogeneous display systems ranging from standard desktop systems to immersive multi-display VR devices.
1
Introduction
Simulations have become a common tool in many fields of engineering today. An integral part of computational simulations is a subsequent analysis process in order to verify the underlying simulation model or to make decisions based on the results of an efficient simulation. While complex phenomena in engineering sciences can seldom be described using a single model, different aspects are examined separately. In practice this was done with rather low coherence between the single simulation models in the past, omitting a lot of potential and accuracy caused by weak or neglected linkage between the simulation models. With the advent of integrative simulation approaches that connect simulation models at the data level to create contiguous representations of the simulated processes, the need for appropriate analysis tools arises. A visualization of multiple datasets is often used to compare datasets to one another. Even more important this approach becomes for the exploration of complex phenomena that are not identifiable until multiple datasets are analyzed in a coherent view. For a multiple dataset visualization this demands for additional G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 681–690, 2011. c Springer-Verlag Berlin Heidelberg 2011
682
T. Beer et al.
functionality in terms of interactivity as compared to traditional single dataset scenarios. The multiplicity of a multiple dataset visualization can be categorized into a horizontal and a vertical aspect [1], whereby the horizontal one addresses a temporal sequence of datasets and the vertical addresses different datasets describing the same time interval in different aspects (e.g. scales of scope). This work focuses on the interactive manipulation of the position of datasets on the horizontal axis as a means to explore the contiguous simulation results. The remainder of this paper is organized as follows. Section 2 shortly explains the simulation platform developed to model and execute integrative simulations. Aspects regarding multi dataset visualization in contrast to single dataset visualizations are discussed in 3. Section 4 outlines the idea of a multi level time model and refers to appropriate related work. The interaction section (5) introduces the Dataset Sequencer interaction metaphor built on top of the aforementioned model. Finally section 6 concludes the work.
2
Integrative Simulation
In the domain of Materials Engineering, the idea of Integrated Computational Materials Processing (ICME) has evolved. It approaches the integration and interconnection of different material models [2,3]. Different aspects of material evolution throughout a production process are simulated at different scales. A simulation platform has been developed [4,5,6] that enables collaborative simulation, utilizing interconnected simulation models and hardware resources, based on a grid-like infrastructure [7]. A key aspect for the coupling of simulation models is the correct translation of data between the single simulation models. For this, a data integration component has been developed that uses domain ontologies to assure semantic correct data integration [8]. Access to the simulation platform is given through a web-interface that allows for collaborative visual design of simulation workflows [4]. Beyond the integrated simulation itself, the analysis of integrated simulation results is an important aspect in order to gain knowledge and insight from those simulation efforts at all. The project concept of the simulation platform thus explicitly includes the aspect of visual data analysis. For the analysis of the integrated processes, the visualization application has to deal with the multiplicity of datasets. Especially with the additional degrees of freedom in the interaction handling induced by it. In a pre-processing step each of the datasets is manually prepared and visualization primitives are extracted, i.e. the object’s surface geometry or e.g. boundary edges of microstructural grains. In many cases domain specific knowledge is required in order to extract meaningful visualization primitives and thus this step cannot be automated completely yet. However, with the further development of the ontology based knowledge representation of the data integrator, some extraction cases could be automated in the future.
A Multi Level Time Model for Interactive Multiple Dataset Visualization
3
683
Multiple Dataset Visualization
The main characteristic that differs a multiple-dataset- compared to a singledataset-analysis apparently is the common visualization context that needs to be established. At the technical level this imposes requirements for data management and interaction components that address this multiplicity. Single datasets often exceed the available resources of a visualization system already. Even more does the integration of multiple datasets into a common visualization context. Thus, the incorporation of decimation techniques is mandatory. We have used different remeshing [9] and coarsening [10] approaches for this to prepare multiple resolution levels of the visualization primitives. Different detail levels can be selected at the beginning of a visualization session, e.g. to adapt to the current system’s resources. Future work will focus on dynamic selection strategies that automatically adapt temporal and geometrical detail levels to the analysis context (cf. 6) at runtime. A fundamental component for dealing with multiple datasets is a data structure that provides structural, topological and logical meta information at different levels of abstraction. On the lowest level the relations between files on disk and how they buildup the single datasets (spatial and temporal) has to be modeled. Especially the interaction components in the system have to account for the multiplicity of entities at runtime and thus more information is needed about the relations the datasets have in the visualization context (spatial, temporal and logical) between the multiple datasets. Some modules of the runtime system need even more, specific meta information like e.g. the connection between different color transfer functions (lookup tables) and the datasets referring to it. Based on the concept of roleobjects [11], the presented application provides a central data model to all of the different software modules the application consists of but at the same time maintains basic software engineering principles like high cohesion and loose coupling [12]. Each module is able to augment it’s view on the data structure with specific data without interfering with other modules or creating inter-dependencies as it would be the case with a na¨ıve shared data structure implementation. Most existing multi dataset visualizations use a multi-view approach, consisting of side-by-side contexts, arranged in a grid on the screen, with each cell containing a single dataset, e.g. [1,13,14]. Generic visualization tools like ParaView [15] or EnSight [16] contain some functionality to work with multiple datasets in multiple views or even to merge a little number of datasets into one. Those are useful for a side-by-side comparison of two datasets, e.g. real data and simulation data (EnSight’s “Cases” feature) or for the analysis of partitioned datasets. But those solutions do not contain features that assist in an interactive analysis of multiple contiguous but heterogeneous datasets that comprise a relation in the time domain. In contrast to these “multi-context” approaches, the presented solution uses a single context, which in turn can be extended to span multiple windows or displays. This makes it usable in multi-screen, and most prominently in immersive display systems, where the usage of multiple side-by-side contexts would disrupt the effect of immersion (cf. 5.1).
684
4
T. Beer et al.
Multi Level Time Model
Time-variant data, which is the dominant kind of simulation results today, inherently contains a notion of time. When dealing with a single dataset in a visualization environment, the handling of spatial and temporal frames is conceptually not that critical. In practice both, spatial and temporal dimensions, are scaled to give a decent insight into the data, which could be interpreted as a normalization of the data presentation into the spatial and temporal attention span of the user and the display system he is using. For a visualization of multiple datasets that have strong inter-relations, the spatial and temporal placement and the interactive manipulation of it have to be considered with much more attention. Although the spatial component in visualization and interaction is important, too, this work is focused on the temporal aspects. In the author’s opinion, this aspect needs to be taken more into account for the complex visualization scenarios that are becoming more important today and probably will even more in the future. An example for this is the aforementioned domain of ICME that inherently needs to handle simulation data from different sources with heterogeneous temporal and spatial resolutions, modeling different aspects of material behavior during a production process [3]. Other domains of simulation sciences are most likely facing the same problems with the growing number of integrative simulation approaches. Handling the temporal relations between datasets in an interactive environment introduces an additional degree of freedom to the analysis space: while all time instants of a single time-varying dataset can be reached with a linear navigation on the time axis, this is not enough for the multiple dataset case. To provide a flexible analysis environment, the temporal relations between the datasets inside the analysis environment have to be handled dynamically. This means that beyond the time-navigation itself, the manipulation of the single datasets’ placement on the timeline has to be considered as well, to allow for the interactive comparative analysis of temporally related datasets. 4.1
Related Work
The idea of a temporal hierarchy is known and used in the fields of 3D Animation and Motion Graphics. Also work in the field of real time graphics, like the X3D standard [17] includes definitions of node types and so called event routes that could be utilized to process time information and thus to integrate a time hierarchy. But in the field of scientific visualization the notion of time is mostly limited to the mapping of discrete timesteps to a single linear time axis. Navigation metaphors stick to VCR-like controls, allowing only for linear navigation [18] and not for further interaction with the time domain. Though, for single-dataset visualization, this apparently is sufficient. For exploratory analysis of multiple time-varying datasets the notion of a temporal hierarchy becomes important. Aoyama et al. [1] have proposed an application named TimeLine which includes – as the name suggests – a timeline view containing thumbnails for multiple data sources the visualization is using
A Multi Level Time Model for Interactive Multiple Dataset Visualization
685
currently. But there is no interaction happening with neither the timeline visualization nor regarding temporal relations between the single data sources in general. The work from Khanduja [13] concentrates on Multi Dataset Visualization, focusing on interactive comparison of volume data, and addresses several aspects arising from this in comparison to single-dataset visualization. Temporal data coherence between timesteps in different datasets is exploited to reduce the runtime data load, but neither the temporal relation between the datasets nor any interaction in the time domain is taking place in his work. Wolter et al. [19] have proposed an abstract time model that describes the mapping between discrete time steps of simulation data and continuous time values, representing different aspects of time: discrete time steps represent a single time instant of the simulation. The simulation time refers to the time that a timestep represents in the original simulation’s time frame. The visualization time normalizes the duration of the dataset into the interval [0..1]. The user time relates to the real-world time a user is acting in while viewing and navigating through an interactive visualization. While this time model fits very well for the handling of arbitrarily sampled datasets, this view is too granular from the viewpoint of handing multiple dataset in a dynamic fashion. The relation between multiple datasets is described in the continuous time domain and does not directly refer to the discretization of timesteps in each single dataset. A simple example clarifies the problem: If a dataset is moved on the time axis beyond the current boundaries of the overall visualization time, the normalized mappings of time values to all of the discrete timesteps of the overall simulation would have to be recalculated. The relative position of all discrete timesteps in relation to the overall timespan of the analysis context changes in that moment. For the static arrangement of multiple datasets this model is suitable, however for dynamic behaviour at runtime, additional work is needed. 4.2
Multi Level Approach and Runtime Time Model
The proposed solution for enabling a dynamic handling of temporal relations consists in building a hierarchy of multiple local time models as they were proposed by Wolter et al. [19]. Each single dataset’s internal discretization issues are encapsulated by this very well, allowing to control each one by the normalized visualization time. The temporal relation between the datasets (i.e. their local time models) and the global timeline in which those are embedded can then be modeled as a graph, describing how the incoming global time needs to be transformed in order to make each dataset aware of its current normalized visualization time. Transformations that are needed to place each dataset in the temporal hierarchy can be reduced to offset and scale. The meta information model includes an hierarchical representation of the data. From this, a directed acyclic graph of time-transformation nodes is built (cf. figure 1). During the update traversal of the application’s event loop, this graph is propagated with the application timestamp which is then transformed through three levels (sequence, track, dataset, cf. 5) before the local time-models are reached. At this point the time value has transformed into the local normal-
686
T. Beer et al.
ized relative visualization time. The mapping of normalized time intervals to the discrete timesteps of the datasets is handled by each dataset’s local time model ([19]). An index structure is pre-computed that maps visualization time intervals to the appropriate timesteps. Finding the current timestep, i.e. the interval that contains the incoming time value, thus is realized by a lower bound query to that index structure. Sequential access is further accelerated by caching the last query result as the starting point for the subsequent query. Hence the additional runtime overhead caused by the multi level approach is negligible compared to a fixed arrangement of multiple datasets in the global temporal context. The graph needed to achieve the dynamic arrangement of datasets at runtime is very small as it only involves the datasets as a whole, not each single timestep.
Fig. 1. Multi-Level Time Model: The application timestamp is dynamically transformed into the local normalized visualization time which then is mapped to the appropriate timestep using a pre-computed index structure
5
Interaction - The Dataset Sequencer
The basic idea of an integrated process visualization is to provide the user with an environment for the analysis of the relations between the different process steps. Compared to the large number of work that can be found for interaction in the spatial domain, less approaches are known for the definition of temporal relations and appropriate interaction methods and no methodology for this has been widely adopted yet in the field of data visualization. While interaction metaphors for the spatial domain can be derived from real-world behavior, e.g. drag-and-drop interaction, an appropriately intuitive real-world paradigm is not available for the manipulation of temporal relations. Thus, for interaction in the time domain, more abstract interaction metaphors have to be developed. The interaction tasks for the integrative multi dataset visualization can be separated into navigation and manipulation, just like spatially characterized interaction tasks [20]. For the navigation, VCR-like controls are widely used (e.g. ParaView [15]) that allow for a linear navigation in the temporal dimension [18]. Interactive
A Multi Level Time Model for Interactive Multiple Dataset Visualization
687
control over the inter-relations of the single datasets, i.e. the sub-aspects of the simulated higher-level process, in the time domain, can be categorized as a manipulation task: it changes the relation of the dataset to the global time axis and to other datasets in the visualization context. This allows to arrange for side-by-side comparisons of similar sub-processes (horizontal axis), e.g. material’s behavior in a heating process, before and after a machining process that may change the material’s behavior in the subsequent heating process. As multiple datasets at different scales may be involved, special care has to be taken to the temporal integrity of the displayed data. Datasets representing the same time intervals have to be consistently aligned in the time domain (interlock on the vertical axis). If this constraint is violated, an inconsistent and simply wrong visualization state occurs that could induce an incorrect mental map [21] of the visualized data to the user, invalidating the whole analysis process in the worst case. In the field of audio and video production, clip-based editing and organization of audio/video events on multiple parallel timelines is a common approach, e.g. with digital audio workstations like Cubase 1 these concepts have matured since the late 1980ies. Our approach tries to utilize akin interaction and 2D visualization methods for the interactive placement of datasets on multiple tracks embedded into a global timeline. The idea of using such an interaction metaphor as a means for real time interaction with a multi-dataset visualization is - to the best of our knowledge - a new approach.
(a)
(b)
Fig. 2. (a) 3D Visualiaztion (b) 2D User-Interface of the Dataset Sequencer
Figure 2 shows an integrated visualization scene and a first prototype of the presented Dataset Sequencer 2D interface. On a global horizontal timeline the datasets are represented as rectangles extending the time interval they are active in. Multiple tracks are stacked vertically to place concurrent datasets, e.g. representing micro- and macro-structural simulations. The temporal position of a datasets in the overall context thus can be intuitively depicted an manipulated by drag-and-drop interaction. Additional hints, like the original alignment of datasets, help the user to keep track of the original context. Additional 1
Steinberg Media Technologies GmbH.
688
T. Beer et al.
meta information about the datasets can be queried and manipulated inside the 2D interface (lower part). This allows for exact setting of e.g. the scale and offset values or direct entry of begin/duration values for a dataset. The initial configuration and the logical relations and grouping information is stored in the aforementioned meta data structure. Explicit begin time values as well as scale-factors and offset values (temporal as well as spatial) can be used for the description of the initial setting. Another option allows for an automatic alignment for the initial setup that simply queues each new dataset to the end of the track it has been selected for. 5.1
Hybrid 2D/3D Interaction
The presented visualization application is targeted to drive heterogeneous display types, e.g. using head-tracked stereo rendering and immersive input devices as well as standard desktop workstations or notebooks. For the navigation tasks VCR-like time-slider controls are embedded into the 3D visualization context (cf. figure 2). Each dataset has a time-slider attached that scrolls through it’s timesteps. The thereby selected timestep is displayed as long as the slider is grabbed or when the global time control is paused. An additional free-floating time-slider controls the global time. This approach works well for navigation tasks in the time domain. Approaches to use the sliders for the manipulation of the temporal relations turned out to be useless as no valuable feedback could be provided to represent the influence of the interactive manipulation to the global relations instantly. Additional graphical objects in the scene could be used for this, but as the multiple datasets already occupy the visualization space and additional objects just would clutter up the scene, this approach was not further pursued. Thus this task was split off into the more abstract 2D Dataset Sequencer GUI. The target display systems for the multiple dataset visualization application are heterogeneous, thus the 2D interface and the core application communicate over a network interface, utilizing the aforementioned (cf. section 3) meta data structure. Depending on the target display system, the Dataset Sequencer GUI can be used in a windowed application side-by-side to the 3D Context on a desktop machine, or on a separate, “control panel” like machine, e.g. in front of a power wall, or on a tablet device literally inside a fully immersive CAVE-like environment.
6
Conclusion and Future Work
We have presented interaction methods for the handling of multi dataset visualizations. Focusing on the temporal aspect of an integrative visualization context, a multi level time model has been presented. A hybrid interaction metaphor for the navigation and manipulation in the temporal dimension suitable for heterogeneous display system architectures has been developed that utilizes 2D and 3D interaction metaphors. The 2D interaction metaphors are based on non-linear
A Multi Level Time Model for Interactive Multiple Dataset Visualization
689
editing concepts found in media production industry but utilizing this concept for interaction with a real time visualization application depicts a new approach in the field of data visualization. The aim of the Dataset Sequencer UI however is not to resemble all the high-sophisticated editing capabilities of audio or video production systems, but to develop interaction methods that ease and assist the analysis process in a multiple dataset visualization environment, inspired by these editing techniques. The main focus of future work will concentrate on the data handling problem residing in the handling of multiple datasets. Considering the temporal and geometrical resolution of the involved datasets as well as data-driven importance metrics and available system resources, heuristic detail selection methods will be researched. Those will provide context sensitive behavior of a dynamic detail selection framework. Other aspects include the refinement of the 2D user interface, for example automatic alignment and vertical interlocking of datasets in the timeline will be improved. The incorporation of data plots into the sequencer view will provide the analyst with more guiding information allowing for easier orientation in the integrated environment of the datasets. In the other direction, user-drawn graphs, sketched within the 2D interface, could be used as an importance metric for the detail selection methods. Acknowledgments. The depicted research has been funded by the German Research Foundation DFG as part of the Cluster of Excellence Integrative Production Technology for High-Wage Countries.
References 1. Aoyama, D.A., Hsiao, J.T.T., C´ ardenas, A.F., Pon, R.K.: TimeLine and Visualization of multiple-data sets and the visualization querying challenge. J. Vis. Lang. Comput. 18, 1–21 (2007) 2. Allison, J., Backman, D., Christodoulou, L.: Integrated computational materials engineering: A new paradigm for the global materials profession. JOM Journal of the Minerals, Metals and Materials Society 58, 25–27 (2006), doi:10.1007/s11837006-0223-5 3. Rajan, K.: Informatics and Integrated Computational Materials Engineering: Part II. JOM 61, 47 (2009) 4. Beer, T., Meisen, T., Reinhard, R., Konovalov, S., Schilberg, D., Jeschke, S., Kuhlen, T., Bischof, C.: The Virtual Production Simulation Platform: from Collaborative Distributed Simulation to Integrated Visual Analysis. Production Engineering, 1–9 (2011), doi:10.1007/s11740-011-0326-x 5. Cerfontaine, P., Beer, T., Kuhlen, T., Bischof, C.H.: Towards a flexible and distributed simulation platform. In: Gervasi, O., Murgante, B., Lagan` a, A., Taniar, D., Mun, Y., Gavrilova, M.L. (eds.) ICCSA 2008, Part I. LNCS, vol. 5072, pp. 867–882. Springer, Heidelberg (2008) 6. Schmitz, G.J., Prahl, U.: Toward a Virtual Platform for Materials Processing. JOM 61, 19–23 (2009) 7. Foster, I., Kesselman, C.: The Grid 2 - Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (2004) ISBN 978-1-55860-933-4
690
T. Beer et al.
8. Meisen, T., Meisen, P., Schilberg, D., Jeschke, S.: Application Integration of Simulation Tools Considering Domain Specific Knowledge. In: Proceedings of the International Conference on Enterprise Information Systems (2011) 9. Valette, S., Chassery, J.M.: Approximated Centroidal Voronoi Diagrams for Uniform Polygonal Mesh Coarsening. Computer Graphics Forum (Eurographics 2004 proceedings) 23(3), 381–389 (2004) 10. Schroeder, W.J., Zarge, J.A., Lorensen, W.E.: Decimation of triangle meshes. SIGGRAPH Comput. Graph. 26, 65–70 (1992) 11. Fowler, M.: Dealing with Role Objects. PloP (2007) 12. Stevens, W.P., Myers, G.J., Constantine, L.L.: Structured Design. IBM Systems Journal 13, 115–139 (1974) 13. Khanduja, G.: Multiple Dataset Visualization (MDV) Framework for Scalar Volume Data. PhD thesis, Louisiana State University and Agricultural and Mechanical College (2009) 14. Bavoil, L., Callahan, S.P., Crossno, P.J., Freire, J., Vo, H.T.: VisTrails: Enabling interactive multiple-view visualizations. In: IEEE Visualization 2005, pp. 135–142 (2005) 15. Squillacote, A.: The Paraview Guide, 3rd edn. Kitware Inc. (2008) 16. Computational Engineering International Inc.: EnSight. Information, http://www.ensight.com (last visited, 2011-05-25) 17. ISO/IEC 19775-1:2008: Information technology – Computer graphics and image processing – Extensible 3D (X3D) – Part 1: Architecture and base components. ISO, Geneva, Switzerland (2008) 18. Bryson, S., Johan, S.: Time management, simultaneity and time-critical computation in interactive unsteady visualization environments. In: Proceedings of the 7th Conference on Visualization, VIS 1996, pp. 255–261. IEEE Computer Society Press, Los Alamitos (1996) 19. Wolter, M., Assenmacher, I., Hentschel, B., Schirski, M., Kuhlen, T.: A Time Model for Time-Varying Visualization. Computer Graphics Forum, 1561–1571 (2009) 20. Bowman, D.A., Kruijff, E., LaViola, J.J., Poupyrev, I.: 3D User Interfaces: Theory and Practice. Addison Wesley Longman Publishing Co., Inc., Redwood City (2004) 21. Misue, K., Eades, P., Lai, W., Sugiyama, K.: Layout Adjustment and the Mental Map. Journal of Visual Languages & Computing 6, 183–210 (1995)
Automatic Generation of Aesthetic Patterns with the Use of Dynamical Systems Krzysztof Gdawiec, Wieslaw Kotarski, and Agnieszka Lisowska Institute of Computer Science, University of Silesia, Poland {kgdawiec,kotarski,alisow}@ux2.math.us.edu.pl
Abstract. The aim of this paper is to present some modifications of the orbits generation algorithm of dynamical systems. The well-known Picard iteration is replaced by the more general one – Krasnosielskij iteration. Instead of one dynamical system, a set of them may be used. The orbits produced during the iteration process can be modified with the help of a probabilistic factor. By the use of aesthetic orbits generation of dynamical systems one can obtain unrepeatable collections of nicely looking patterns. Their geometry can be enriched by the use of the three colouring methods. The results of the paper can inspire graphic designers who may be interested in subtle aesthetic patterns created automatically.
1
Introduction
There are many domains in which aesthetic value plays an important role, i.e. architecture, jewellery design, fashion design, etc. Judging aesthetics is a highly subjective task. Different people have different beauty appreciation principles. In formal considerations the following features of pattern geometry aesthetics are taken into account [12]: golden ratio, symmetries (rotational, logarithmic, spiral, mirror), complexity, compactness, connectivity and fractal dimension. Colours give additional contribution to beauty of patterns. In the paper we do not formally evaluate aesthetics of the generated patterns. We are basing on the opinions of our colleges and students who have seen our patterns. Majority of them said that the patterns were looking nicely and they might stimulate creativity of designers and reduce the number of physical prototypes of patterns. Usually the most of work during a design stage is carried out by a designer manually. Especially, in the cases in which a graphic design should contain some unique unrepeatable artistic features. Therefore, it is highly useful to develop an automatic method for aesthetic patterns generation. In recent years several approaches to create nicely looking patterns have been described in literature. For example in [8], [12] the methods based on Iterated Function Systems (IFS) and Genetic Algorithms (GA) for jewellery design were proposed. The approach presented in [12] is limited to 2D patterns. Iterated Function Systems and Gumowski-Mira transform were used for fashion design in [7]. An interesting method based on root-finding of polynomials, called polynomiography, was presented in [4]. Polynomiography, patented in the USA in 2005, produces nicely looking highly predictable art patterns. G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 691–700, 2011. c Springer-Verlag Berlin Heidelberg 2011
692
K. Gdawiec, W. Kotarski, and A. Lisowska
In the paper we present algorithms for aesthetic patterns generation using one dynamical system or a set of them. The dynamical systems used in our research generate nicely looking aesthetic complex geometric shapes, different from those created with the help of IFS, GA or polynomiography based methods. Additionally, we propose three colouring algorithms to enrich the aesthetic value of the generated shapes. In Section 2 the basic information about dynamical systems with their examples producing nicely looking orbits are presented. Section 3 describes two algorithms for patterns generation and three further ones for colouring of the obtained patterns. The sample results are presented in Section 4. Finally, in Section 5 some concluding remarks and plans for the future work are given.
2
Dynamical System
Let us start with the definition of a dynamical system [1]. Definition 1. A dynamical system is a transformation f : X → X on a metric space (X, d). Next, define the orbit of a dynamical system [1]. Definition 2. Let f : X → X be a dynamical system on a metric space (X, d). The orbit of a point x ∈ X is the sequence {xn }∞ n=0 , where xn = f ◦ . . . ◦ f (x) = f ◦n (x).
(1)
n times
For n > 0 equation (1) can be written in the following form: xn = f ◦n (x) = f (f ◦(n−1) (x)) = f (xn−1 ).
(2)
Iteration (2) is called Picard iteration and it is used usually to generate the orbit of a given point for any dynamical system. In the rest of the paper the space IR2 with the Euclidean metric as the metric space (X, d) is used. Many examples of dynamical systems are known [6], but we are mainly interested in those which produce geometric patterns that can be recognized as aesthetic ones. Now, we present the examples of such dynamical systems: – Gumowski-Mira transformation (CERN, 1980) [3] 2 xn = yn−1 + α(1 − 0.05yn−1 )yn−1 + g(xn−1 ),
yn = −xn−1 + g(xn ),
(3)
where g : IR → IR is defined as follows: g(x) = μx + and α, μ ∈ IR,
2(1 − μ)x2 1 + x2
(4)
Automatic Generation of Aesthetic Patterns
693
– Martin or Hopalong transformation [5] xn = yn−1 − sgn(x) |bxn − c|, yn = a − xn−1 ,
(5)
where a, b, c ∈ IR and sgn : IR → IR is defined as follows: ⎧ ⎪ ⎨−1 if x < 0, sgn(x) = 0 if x = 0, ⎪ ⎩ 1 if x > 0,
(6)
– Zaslavsky transformation [6] xn = (xn−1 + K sin yn−1 ) cos α + yn−1 sin α,
(7)
yn = −(xn−1 + K sin yn−1 ) sin α + yn−1 cos α, where K ∈ IR, α = 2π q , q ∈ IN, q ≥ 3, – Chip transformation created by Peters for the HOP program [9] xn = yn−1 − sgn(xn−1 ) cos(ln |bxn−1 − c|)2 · arctan(ln |cxn−1 − b|)2 ,
(8)
yn = a − xn−1 ,
where a, b, c ∈ IR, – Quadrup Two transformation also created by Peters for the HOP program [9] xn = yn−1 − sgn(xn−1 ) sin(ln |bxn−1 − c|) arctan(cxn−1 − b)2 , yn = a − xn−1 ,
(9)
where a, b, c ∈ IR. The other examples of dynamical systems, producing interesting orbits, are: Three Ply [9], Quadrup Two [9] and Cockatoo [5]. In Fig.1 the examples of orbits for transformations (3), (5), (7), (8), (9) are presented.
3
Automatic Generation of Patterns
In this section we present the algorithms for patterns generation using one dynamical system or a set of them. The obtained geometry of shapes is further coloured with the help of the three colouring algorithms. In both algorithms more general iteration process than Picard iteration is used. Definition 3 ([2]). Let T : E → E be a selfmap on a real normed space (E, ·), x0 ∈ E and λ ∈ [0, 1]. The sequence {xn }∞ n=0 given by xn+1 = (1 − λ)xn + λT (xn )
(10)
is called Krasnoselskij iteration procedure or simply Krasnoselskij iteration.
694
K. Gdawiec, W. Kotarski, and A. Lisowska
Fig. 1. The examples of orbits (n = 70000), the top row (from the left): GumowskiMira, Hopalong, Zaslavsky, the bottom row (from the left): Chip, Quadrup Two.
It is easy to see that Krasnoselskij iteration for λ = 1 reduces to Picard iteration. In [11] Krasnoselskij iteration has been used to obtain a new class of superfractals and in [10] to obtain a new class of Julia sets. In the first algorithm we use only one dynamical system, a starting point [x0 , y0 ]T , the number of iterations n and a value of parameter λ ∈ [0, 1]. The orbit of the dynamical system is generated by the use of randomly chosen Picard or Krasnosielskij iteration at every iteration step. The algorithm is presented in Algorithm 1.
Algorithm 1. Pattern generation with the use of one dynamical system Input: [x0 , y0 ]T – starting point, n ∈ IN – number of iterations, f : IR2 → IR2 – dynamical system, λ ∈ [0, 1] Output: sequence of points [x0 , y0 ]T , . . . , [xn , yn ]T forming the pattern 1 2 3 4 5 6
for i = 1 to n do draw a number r ∈ [0, 1]; if r < 0.5 then [xi , yi ]T = f ([xi−1 , yi−1 ]T ); else [xi , yi ]T = λ · f ([xi−1 , yi−1 ]T ) + (1 − λ)[xi−1 , yi−1 ]T ;
In the second algorithm we use: a set of dynamical systems {f1 , . . . , fk }, probk abilities p1 , . . . , pk such that i=1 pi = 1, pi > 0, a starting point [x0 , y0 ]T , the number of iterations n and a value of parameter λ ∈ [0, 1]. In each iteration
Automatic Generation of Aesthetic Patterns
695
step one dynamical system is drawn (e.g. fj ) from the set of them according to the given probability distribution. The point from the previous iteration is then transformed by fj with the use of Krasnoselskij iteration. The procedure described above is presented in Algorithm 2. Algorithm 2. Pattern generation with the use of dynamical systems set Input: [x0 , y0 ]T – starting point, n ∈ IN – number of iterations, {f1 , . . . , fk } – dynamical systems, fi : IR2 → IR2 , p1 , . . . , pk – probabilities k i=1 pi = 1, λ ∈ [0, 1] Output: sequence of points [x0 , y0 ]T , . . . , [xn , yn ]T forming the pattern 1 2 3
for i = 1 to n do draw a number j ∈ {1, . . . k} according to the probability distribution {p1 , . . . , pk }; [xi , yi ]T = λ · fj ([xi−1 , yi−1 ]T ) + (1 − λ)[xi−1 , yi−1 ]T ;
By using Algorithm 1 or 2 we obtain a sequence of points [x0 , y0 ]T , . . . , [xn , yn ]T forming a pattern. That pattern can be easily modified by joining all consecutive points with lines. We can also skip some points (with a predefined step) in the joining process. These modifications often create nicely looking geometrical shapes. A severe geometry of shapes can be enriched by using colours because the colour plays an important role in pattern perception. If we use wrong palette of colours or colours distribution over the pattern is wrong then the pattern might be considered as unattractive or not aesthetic. So, we use three different algorithms to colour points of patterns: distance colouring (Algorithm 3), iteration step colouring (Algorithm 4), mixed colouring (Algorithm 5). The colouring according to the distance is presented in Algorithm 3. In this algorithm we use points for which we want to compute the colour, a colour map (the table of K colours) and arbitrary metric on IR2 . First, the bounding box of the points is determined together with its centre and the half length of its diagonal D. Next, for each point the distance between this point and the centre of the bounding box is computed. Having this distance, it is divided by D obtaining a number in [0, 1]. Finally, the index of the colour in the given colour map is computed by transforming the number from [0, 1] to the number belonging to set {0, 1, . . . , K − 1}. The colouring according to iteration step is presented in Algorithm 4. In this algorithm we use only the points for which the colours are computed and a colour map (the table of K colours). For each point the quotient of its number and the total number of points is computed obtaining a number in [0, 1]. Next, by transforming this number to the number belonging to set {0, 1, . . . , K − 1} an index of the colour from the given colour map is obtained. The last colouring algorithm (mixed colouring) is presented in Algorithm 5. In this algorithm, similarly to the two algorithms presented earlier, we use points,
696
K. Gdawiec, W. Kotarski, and A. Lisowska
Algorithm 3. Distance colouring Input: [x0 , y0 ]T , . . . , [xn , yn ]T , rgb[0..K − 1] – colour map, K – number of colours, d : IR2 × IR2 → [0, +∞) – metric Output: colours c0 , . . . , cn 1 2
Find the bounding box of the given points. Let [xc , yc ]T be the centre of the bounding box and D be the half length of its diagonal; for i = 0 to n do
T
3
j = (K − 1) d([xc ,yc ] D,[xi ,yi ]
4
ci = rgb[j];
T)
;
Algorithm 4. Iteration step colouring Input: [x0 , y0 ]T , . . . , [xn , yn ]T , rgb[0..K − 1] – colour map, K – number of colours Output: colours c0 , . . . , cn 1 2 3
for i = 0 to n do j = (K − 1) ni ; ci = rgb[j];
Algorithm 5. Mixed colouring Input: [x0 , y0 ]T , . . . , [xn , yn ]T , rgb[0..K − 1] – colour map, K – number of colours, d : IR2 × IR2 → [0, +∞) – metric Output: colours c0 , . . . , cn 1 2
Find the bounding box of the given points. Let [xc , yc ]T be the centre of the bounding box and D be the half length of its diagonal; for i = 0 to n do 1 (K 2
3
j=
− 1)
4
ci = rgb[j];
d([xc ,yc ]T ,[xi ,yi ]T ) D
+
i n
;
colour map and a distance on IR2 . The colour index is determined as the mean value of the indices obtained in Algorithms 3, 4.
4
Examples
The examples of the proposed algorithms results are presented in Figs. 2-5. In Fig. 2 we can see the example of the pattern colouring. In the upper part there is the colour map used to colour the pattern and in the bottom part there is the same pattern coloured by the use of the proposed three colouring algorithms, from the left: the distance colouring, the iteration step colouring and the mixed colouring. An example of pattern obtained from Algorithm 1 is presented in Fig. 3. In the upper part we can see the original orbit obtained by the use of Picard iteration
Automatic Generation of Aesthetic Patterns
697
Fig. 2. The example of pattern colouring. The colour map (top), the coloured pattern (bottom from the left: the distance, the iteration step, the mixed colouring).
Fig. 3. The sample patterns obtained from Algorithm 1. The original orbit (top), the obtained patterns (bottom).
and in the bottom part we can see the three examples of patterns obtained with the first algorithm and coloured with the use of the proposed methods. As one can see the obtained patterns have very diverse forms and differ from the original one. Another example is presented in Fig. 4. In this case Algorithm 2 was used to generate the patterns. The set of dynamical systems consists of two systems presented in the upper part of the figure. In the lower part of the figure the obtained patterns are presented. We can see that some of the patterns have inherited parts of the original patterns geometry. We can also note that we have obtained quite new geometrical forms. In the last example, presented in Fig. 5, the patterns obtained from Algorithm 2 are presented. In the upper part the patterns that form the set of
698
K. Gdawiec, W. Kotarski, and A. Lisowska
Fig. 4. The sample patterns obtained from Algorithm 2. The original orbits (top), the obtained patterns (bottom).
Fig. 5. The sample patterns obtained from Algorithm 2. The original orbits (top), the obtained patterns with points joined with lines (bottom).
dynamical systems are presented. The lower part presents the obtained patterns, but this time instead of drawing points we have joined them by lines with different omitting step. The colour of each line was determined according to the colour of the line starting point which was computed using one of the proposed colouring algorithms. The obtained examples, as one can see, form very interesting patterns.
Automatic Generation of Aesthetic Patterns
5
699
Conclusions
In the paper two new algorithms for patterns generation with the use of dynamical systems are presented. The first algorithm applies only one dynamical system and the random iteration procedure – Krasnoselskij or Picard type iteration which should be drawn. In the second one a set of dynamical systems is applied and the dynamical system, which should be used to transform the point by Krasnoselskij iteration, is chosen randomly. Using these two algorithms one can obtain new and interesting patterns that are very sensitive with respect to λ parameter from Krasnoselskij iteration. The best results can be obtained for λ ∈ [0.99, 1], that means λ should be close to 1. Our algorithms generate new and unrepeatable patterns which differ from the original ones, whereas the method from [7], which also is based on dynamical systems, generates patterns which are determined by the form of the used dynamical system and cannot create new patterns. An important role in obtaining aesthetic patterns plays the colour, so we proposed three colouring algorithms. The first one is based on the distance from the centre of the points bounding box. The second one is based on the iteration number and the third one is the average of the two previous methods. The obtained patterns have an aesthetic value, so, they can be used as any usable patterns, e.g. textile patterns, ceramics patterns, etc., or can be used in jewellery and decoration design. In our further research we will try to replace Krasnoselskij iteration by any other types of iterations, e.g. Mann (Krasnoselskij iteration is a special case of this iteration type), Ishikawa and ergodic iterations [2]. Additionally, we will try to use texture synthesis techniques to create not only coloured but also textured patterns [13]. We also would like to concentrate on finding an automatic evaluation procedure which tells the user whether the obtained pattern satisfies some formal predefined criteria of patterns aesthetics evaluation. Such automatic evaluation procedures for fractal patterns can be found e.g. in [8], [12].
References 1. Barnsley, M.: Fractals Everywhere. Academic Press, Boston (1988) 2. Berinde, V.: Iterative Approximation of Fixed Points, 2nd edn. Springer, Heidelberg (2007) 3. Gumowski, I., Mira, C.: Recurrences and Discrete Dynamic Systems. Springer, New York (1980) 4. Kalantari, B.: Polynomial Root-Finding and Polyniomiography. World Scientific, Singapore (2009) 5. Martin, B.: Graphic Potential of Recursive Functions. In: Landsdwon, J., Earnshaw, R.A. (eds.) Computers in Art, Design and Animation, pp. 109–129. Springer, Heidelberg (1989) 6. Morozov, A.D., Dragunov, T.N., Boykova, S.A., Malysheva, O.V.: Invariant Sets for Windows. World Scientific, Singapore (1999)
700
K. Gdawiec, W. Kotarski, and A. Lisowska
7. Naud, M., Richard, P., Chapeau-Blondeau, F., Ferrier, J.L.: Automatic Generation of Aesthetic Images for Computer-assisted Virtual Fashion Design. In: Proceedings 10th Generative Art Conference, Milan, Italy (2007) 8. Pang, W., Hui, K.C.: Interactive Evolutionary 3D Fractal Modeling. Visual Computer 26(12), 1467–1483 (2010) 9. Peters, M.: HOP – Fractals in Motion, http://www.mpeters.de/mpeweb/hop/ 10. Rani, M., Agarwal, R.: Effect of Stochastic Noise on Superior Julia Sets. Journal of Mathematical Imaging and Vision 36(1), 63–68 (2010) 11. Singh, S.L., Jain, S., Mishra, S.N.: A New Approach to Superfractals. Chaos, Solitons and Fractals 42(5), 3110–3120 (2009) 12. Wannarumon, S., Bohez, E.L.J.: A New Aesthetic Evolutionary Approach for Jewelry Design. Computer-Aided Design & Applications 3(1-4), 385–394 (2006) 13. Wei, B.G., Li, J.P., Pang, X.B.: Using Texture Synthesis in Fractal Pattern Design. Journal of Zhejiang University Science A 7(3), 289–295 (2006)
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography Christopher Gat, Alexandra Branzan Albu, Daniel German, and Eric Higgs University of Victoria, BC, Canada {cgat,aalbu,dmg,ehiggs}@uvic.ca
Abstract. This study reports on the quantitative evaluation of a set of state-ofthe-art feature detectors in the context of repeat photography. Unlike most related work, the proposed study assesses the performance of feature detectors when intra-pair variations are uncontrolled and due to a variety of factors (landscape change, weather conditions, different acquisition sensors). There is no systematic way to model the factors inducing image change. The proposed evaluation is performed in the context of image matching, i.e. in conjunction with a descriptor and matching strategy. Thus, beyond just comparing the performance of these detectors, we also examine the feasibility of feature-based matching on repeat photography. Our dataset consists of a set of repeat and historic images pairs that are representative for the database created by the Mountain Legacy Project www.mountainlegacy.ca.
Keywords. Feature Detectors, Image Matching, Repeat Photography, Rephotography, Computational Photography, Evaluation.
1
Introduction
Repeat photography (also called rephotography) is the process of repeating a photograph from the same vantage point of a reference photo. When the time passed between the repeat and the reference image is substantial, the differences in content can be dramatic. Repeat photographs have been used to raise interest in urban change through New York Changing [1] and the Then and Now [2] book series. Changes in natural landscapes have also been recorded in several repeat photography projects. The Second View and Third View books [3-4] have repeated landscape images for the purpose of exhibited the change that has occurred in the American Midwest over the last 100 years. The Mountain Legacy Project (MLP) [5] houses the largest systematic collection of historic survey images in the world (est. over 140,000) and is planning to repeat these photographs on a regular basis. These repeat photograph pairs have been used as evidence for climate change and research in environmental studies and ecological restoration [6-9]. Typical tasks in the repeat photography process involve the determination of geographical location and field of view of the original photo (e.g. near the top of a certain mountain) and the manual matching/alignment of the original and repeat images. Computer Vision can play a major role in both tasks. Our focus here is on image matching. Repeat photographs are rarely taken with the same camera and lens G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 701–714, 2011. © Springer-Verlag Berlin Heidelberg 2011
702
C. Gat et al.
as the original, therefore even when a historic image has been well repeated, the images need to be scaled, rotated, and translated to produce the final alignment of the images. Feature-based matching approaches [10-12] are well suited for image alignment. Feature-based matching involves three stages: detection, description, and matching. The detection phase deals with finding salient points/regions within an image that may be associated with a scale and shape. These feature points or regions can then be described by a descriptor, a method that characterizes the local structure of the point or region as a feature vector. The matching stage defines how features vectors from each image are tested for correspondence. There is a variety of detection [13-18], description[17], [19-21], and matching methods[10], [17], [22], and in general these methods can be interconnected to form a feature-based matching system. The study by Bae and Agarwala[12] noted that the task of finding matching pairs of points is difficult due to differences in film response, aging, weather, time-of-day, and unknown camera parameters. Specifically, they concluded via a pilot study that the SIFT algorithm [17] was not robust enough to automatically find matching points between images and user input was necessary to remedy this problem. No quantitative data was given related to the SIFT failure in these conditions. This paper investigates a similar research question to [12]. Specifically, the proposed study explores the feasibility of feature-based matching in the context of repeat photography, with a focus on the feature detection step. Our paper reports on the quantitative evaluation of four state-of-the-art feature detectors on a representative subset of the Mountain Legacy database [5]. Our evaluation of feature detectors is performed in the context of image matching, i.e. in conjunction with a feature descriptor and matching strategy. Our dataset consists of 73 pairs of pre-aligned repeat and historic images of various mountain landscapes. We choose to work with pre-aligned images in order to focus our study on the performance of feature detectors in the context of uncontrolled physical differences between the original and repeat components of each pair. Fig. 1 shows some examples of image pairs in our dataset. One may notice the variation in landscape, illumination, and weather conditions.
Fig. 1. Examples of image pairs from the Mountain Legacy Project
Due to the variety of changes between original photographs and their repeat counterparts, previous evaluations of feature detectors (see Section 2) do not adapt well to our dataset. Therefore, we contribute a novel approach to comparing feature
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography
703
detectors for image matching on a database involving uncontrolled changes. The remainder of the paper is structured as follows. Section 2 overviews related work. Section 3 describes the proposed evaluation framework. Results are discussed in section 4, and section 5 draws conclusions.
2
Related Work
Evaluation of detectors and descriptors has been studied by many research groups [23-28]. The general focus of these evaluations is the performance comparison of detectors and/or descriptors when images undergo viewpoint, scale, rotation, and illumination transformations, under controlled variations of 3D and 2D scenes. Snavely and Seitz [29] propose an application (Photo tourism) based on feature matching on a complex image dataset; the application uses SIFT [17] to match points and retrieve structure from motion for reconstructing 3D representations of a scene from a large variety of photos. The dataset consists of large collections of photos over a large scene (e.g. a city plaza in Prague) taken by tourists at different perspectives, lighting conditions, time, and weather. The complexity of the matching problem is solved by the transitivity of the dataset, where image matching is performed with a many-to-many correspondence strategy. This technique has great potential for periodically repeated and spatially dense photography; however, it does not apply to our dataset, where photos have been repeated just once, with no spatial overlap between different pairs of images. Schindler et al. [30] adds a temporal dimension to the photo tourism experience by sorting historic urban photographs based on changes in scene structure. Correspondences between historic images were done manually in [30] rather than automatically generated. Numerous evaluations have been done for detectors using scenes that relate geometrically via homographies. Schmid et al.[ 24] evaluated several interest point detectors on planar scenes with rotation and scale transformations with respect to the repeatability criterion. Repeatability is defined as the ability of a detector to identify a point in the scene, after the scene was transformed in a known and controlled manner. Mikolajczyk and Schmid [15] later adapted the repeatability criterion to consider scale and affine invariant regions. Mikolajczyk et al. [31] compared a number of affine invariant detectors on two types of scenes (textured and structured) for image pairs containing different transformations (viewpoint, scale, rotation, illumination, blur, and jpeg compression). Moreels and Perona [27] expanded the evaluation of detectors and descriptors to non-planar scenes. They introduced a novel and automatic approach to create grounds truths for 3D scenes and improved upon previous evaluations by increasing the size of the dataset; their tests included 100 object types viewed from 144 calibrated viewpoints under three different lighting conditions. Detectors and descriptors were evaluated as one single algorithmic unit. Fraundorfer and Bischof [25] also evaluated detectors on non-planar scenes, and found that the performance of detectors and descriptors is significantly reduced when applied to 3D scenes. Some evaluation studies are primarily focused on descriptors. Mikolajczyk et al [28], and Carneiro [32] evaluated feature descriptors in the context of image retrieval. Mikolajczyk and Schmid [23] evaluated descriptors in the context of feature matching.
704
C. Gat et al.
Application-specific evaluations have focused on the joint performance of detectors and descriptors. Mikolajczyk and Leibe [33] examined five detectors and descriptors for the purpose of object recognition. Gil et al [26] evaluated the feature matching ability of detectors and descriptors in the context of vision-based simultaneous localization and mapping. Valgren et al. [34] examined the SIFT and SURF [35] methods (combined detection and description) for finding matching points on outdoor images that differ by seasonal changes for the purpose of topological localization. All images in their dataset were taken with the same camera over a period of 9 months. In contrast to these evaluations, our dataset is not defined by controlled sequences of transformations and it includes scenes that incur change beyond the transformations previously studied. Due to these differences, we propose a novel evaluation framework and novel evaluation criteria with respect to previous related work.
3
Proposed Evaluation Methodology
The proposed evaluation methodology is structured as a three-step approach involving the selection of feature detectors to be evaluated, the definition of feature correspondence measures, and the specification of criteria for performance evaluation. All three steps are detailed below. Selection of feature detectors. The detectors selected for this study are the HarrisLaplace [15], Hessian-Laplace [31], Difference of Gaussians [17] and Maximally Stable Extremal Regions (MSER) [36]. These methods represent the state-or-the-art in terms of point and region-based detectors [37]. Parameters for all detectors have been set to their generally adopted default values, in a manner consistent with previous evaluations [25], [31], [38]. Harris-Laplace: This method uses the Harris corner detector [16] to find interest points at multiple scales, then leverages the Laplacian-of-Gaussian operator for scalespace selection [39]. The output is a set of feature locations and scales. Hessian-Laplace: This method finds interest points at multiple scales using the determinant of the Hessian, then performs automatic scale detection with the Laplacian-of-Gaussian. The output is a set of blob-like regions. Difference-of-Gaussian (DoG) : This method detects regions by searching for 3D local maxima in a scale space of difference-of-Gaussian images. Maximally Stable Extremal Region s(MSER): The MSER features are connected components of a thresholded image. Only connected components that have a stable size over a range of thresholds are considered pertinent to use as features. As with previous evaluations [31], [33] we use the elliptical region approximation rather than the raw border of the connected component. Feature Description. Once local features (points and regions) detected, the next step needs to describe the area around the feature. The SIFT descriptor [17] has been chosen, since it has been a de facto standard in commercial applications for over the past ten years. Furthermore, this descriptor has shown strong performance in several past evaluations [23], [27], [28]. The SIFT descriptor is computed as follows. In a
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography
705
given region, the magnitude and orientation (quantized into 8 orientations) of the gradients are calculated. The region is then divided into a 4x4 grid, and for each grid cell an 8 level histogram is calculated based on the orientations. The 16 (4x4) histograms are then concatenated to form a 128-element vector that is used for description. For illumination invariance a SIFT specific technique was used. The 128-element vector is normalized so that its vector length is 1. This normalization makes the descriptor invariant to affine illumination because both multiplicative and additive changes to all intensity values of a region do not affect the direction of the vector. SIFT also accounts for non-linear illumination change by suppressing the normalized vector for values larger than 0.2. The vector is then renormalized to the unit length. The motivation behind this method is that non-linear illumination changes will more likely affect gradient magnitude, rather than overall gradient orientation [17]. Fig. 2 shows examples of local features and their corresponding regions for the four detectors used in our evaluation. All detectors are applied to the same image pair. Feature Correspondence. Once the two sets of feature vectors are computed for a pair of images, image matching may be attempted. The first step in the matching process is defining a distance between the feature vectors of each image. Two common distance measures are the Euclidean and Mahalanobis distances[21], [23]. Moreels and Perona[27] found that the ranking of detectors and descriptors under comparison stayed the same regardless of which of these measures were used. Therefore, the Euclidean distance measure was chosen due to its ease of use. A match is determined by finding the feature vector from one image with the smallest distance to a feature vector from the second image (i.e. nearest neighbour matching). This distance must also be below a distance threshold for a successful matching. Our dataset is treated in a similar manner to the Oxford dataset in [23], [31], since the geometric relationship between the image pairs can be described by a homography. It is assumed that the repeat photograph has been taken from the same viewpoint as the historic image, and therefore the homography that describes the geometric relationship between the two images is simply a 3x3 identity matrix. As explained in the introduction, this simplified case enables us to focus on the effect of physical changes (i.e. weather, landscape, illumination) on the image matching process. As with other previous evaluations [15], feature correspondence is defined by specifying an error tolerance for both location and overlap. Let a local feature be represented by its location x, and its region μ (where location x is the center of mass of the region μ). Two features, (xa, μa) and (xb, μb), correspond in ground truth if: a) Location error εl is less than 6 pixels, where εl is the Euclidean distance between xa and xb; b) The overlap error, εo, is less than 50%, where the overlap error is defined as the ratio between symmetric difference between the two region sets, normalized by their union. The same criteria were used by Mikolajczyk and Schmid [15], with different tolerance values (εl less than 1.5 pixels, εo less than 40%). The tolerance values used in this study are less restrictive due to frequent viewpoint shifts and lens distortions that occur in repeat and historic images.
706
C. Gat et al. MSER
Hessian-Laplace
Harris-Laplace
Difference-of-Gaussian
Fig. 2. Examples of local features computed by the evaluated detectors (only a portion of the total corresponding features is shown). All detectors work on the same image pair.
Performance Criterion-The Pass Rate. Related work uses repeatability [25], [26], [31], number of correct matches [31], [34], precision [23], [34], and recall [23] as criteria in evaluating the performance of feature detectors/descriptors for image matching. Some of these criteria are not applicable in the context of the proposed work. More details follow. Repeatability is not measurable in scenes subjected to physical changes. Accidental correspondences may be found in areas of the image where changes occur (see Fig. 3). Such accidental correspondences may artificially inflate the repeatability rate. The opposite is also true; a physical change in the scene may cause a feature not to be repeated, therefore having a negative influence on the repeatability rate. Recall cannot be a basis of comparison when evaluating feature detectors, since each detector outputs a different set of features, and thus a different set of ground truth matches. Previous work [23] uses recall when evaluating performance of feature descriptors, rather than detectors. Both precision and number of correct matches are relevant to repeat photography; they have been previously used in a study on a dataset containing seasonal changes [34]. A correct match occurs when features correspond and meet predetermined matching criteria. Based on precision and number of correct matches, we introduce a new performance criterion, namely the pass rate. This criterion is relevant for evaluating the performance of the descriptors according to contextual requirements imposed by various applications. The quantity of correct matches needed is based on the application. For example, only two pairs of matching points are needed to resolve scale, rotation, and translation between a pair of images, while eight pairs of matching points are needed to recover the fundamental matrix between 3D scenes taken at different viewpoints [40]. The pass rate is defined as the percentage of image pairs out of the entire dataset that fall within the tolerances imposed on the precision (ptol) and number of correct
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography
707
matches (ctol). The pass-raate is computed with the process shown in Fig. 4. T The process unfolds as follows. A detector-descriptor is applied to the image pair in the dataset, resulting in a set of feature vectors that are processed by the matchhing module. Matching is perforrmed using the nearest-neighbor approach with a distaance threshold. The distance th hreshold applies to the feature space and specifies the tolerance for matching; th hat is, if the nearest neighbor falls outside the distaance threshold, then no match is found. A set of N different distance thresholds {t1, t2, …, tN} is applied. For each imagee pair, the precision and number of correct matches are computed, then compared to t the tolerance values ptol and ctol; the image is passedd or failed. The pass/fail decisio ons are aggregated into the pass rate over the entire datasset.
Fig. 3. Example of an "accid dental correspondence” detected with the Hessian-Laplace. The feature on the left is oriented on forest clear-cut, while the feature on the right is orientedd on forest cover.
Fig. 4. Overview O of the computation of the pass rate
Three distance threshold ds are considered: ti=150, 250, and 350. These values w were considered in accordance with w Ke and Sukthankar [21]; they found that SIFT operaated optimally (in the context of o image retrieval) with a distance threshold of 141, but suggested good performancce may be obtained with a variety of thresholds.
708
C. Gat et al.
As previously mentioned, different amounts of correct matches per image pair are necessary for different applications. Therefore, the pass rate is computed for 6 different tolerances for the number of correct matches: ctol= 2, 10, 25, 50, 100, and 200. Different precision values are also required for different applications. Therefore, the pass rate is computed at 6 precision tolerances: ptol= 50%, 40%, 30%, 20%, 10%, and strictly greater than 0%. The quantitative evaluation data is presented as a function of the tolerances for the distance threshold, precision and number of correct matches. Section 5 discusses the results of the proposed evaluation process.
4
Experimental Results and Discussion
The proposed evaluation process has been performed on a representative subset of 73 manually aligned repeat and historic landscape image pairs from the Mountain Legacy Project dataset [5]. The images depict both mountainous and foothill landscapes from the Canadian Rocky Mountains. The historic images were taken on mapping surveys between 1896 and 1928 using different cameras and lenses. Historic images were scanned in high definition (1800 dpi, 16 bit) from glass plate negatives at Library and Archives Canada. The repeat photographs were acquired with a Hasselblad H3D-39 digital medium format camera between 2008 and 2010. Lens distortions were removed with proprietary Hasselblad raw converter software. The results of the quantitative evaluation are visualized in Figure 5. This section first discusses and compares the performances of the detectors illustrated by the graphs in Fig. 5. Next, it compares our results with findings from other evaluation studies. Finally, some application specific considerations are presented. Discussion of Detector Performances. Ideally, one aims at obtaining feature matching with high precision and a high number of correct matches. However, there is a trade-off between precision and correct number of matches. At low distance thresholds, the precision is expected to be better, since there is a stricter requirement on the similarity measure in the feature space. This stricter requirement also restricts the number of possible matches that will be returned. Therefore, when moving the distance threshold higher, it is expected that more correct matches will be found at the cost lower precision. At a distance threshold of 150, the precision is higher, but at the cost of less correct matches and a lower pass rate (at lower precision values) because images with more changes do not have any matches at all. At a distance threshold of 250, the pass rate is higher at lower precision, and more image pairs pass at higher correct matches requirements. At a distance threshold of 350, the expected benefit is more correct matches, meaning a higher pass rate at higher correct matches requirements. However, there is very little benefit in this case (compare the 100 minimum correct match requirements between 250 and 350), but overall precision decreases. The overall best performing detector is the Hessian-Laplace followed by the DoG. At the lowest precision requirement, the DoG does show better performance.
Fig. 5. The pass rate graphs. Each row is associated with one distance threshold and each column one minimum number of correct match pass requirements. The curves represent a change in pass rate relative to the precision value requirement.
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography 709
710
C. Gat et al.
Therefore, our study found that DoG finds more image pairs with at least some correct matches than Hessian-Laplace, but with worse precision. These detectors are able to detect corresponding features for every image pair in the dataset, and 2 or more correct matches for a large majority (90-99%) of these image pairs. In the most difficult requirements, where precision of 50% and 200 correct matches are required, the pass rate is very low (around 3% of the image pairs). The Harris-Laplace and MSER have a subset of image pairs for which the precision requirement is stable (i.e. less significant slope in the graphs), but suffer from the fact that their features cannot be matched for a large portion of the dataset. This is partially due to the absence of any corresponding features being detected between the image pairs. In the case of the Harris-Laplace, it was found that 10% of the image pairs had no corresponding features, while for MSER 6.4% of the image pairs had no corresponding features. Both the Hessian-Laplace and DoG are blob detectors based on second derivatives. This suggests that blob-like features might be more suitable than point-like features (from the Harris-Laplace) in the context of repeat photography. One should also note that the Hessian-Laplace and DoG found, on average, significantly more features than Harris-Laplace and MSER. The MSER performed the worst, even though [31] showed its good performance for viewpoint and affine illumination change for both textured and structured planar scenes. This detector is based on the stability of connected components created when incrementing a binary threshold on the images. When viewpoint is the only transformation on a planar scene, the same connected components will be stable because there are no occlusions. However, in the case of complex changes of a scene, the shape and stability of the connected components change drastically, resulting in a low number of feature correspondences. Comparison with other evaluation studies. The direct comparison of our results to other evaluations is not straightforward, since our evaluation is performed on pairs of images where an uncontrolled amount of physical change has occurred. Mikolajczyk et al. [23] evaluate feature descriptors and provides quantitative measures in terms of precision and correct matches. The image transformations are controlled; the most difficult test case was a viewpoint change of 50o on a planar scene. A feature matching test with the SIFT descriptor obtains a 44% precision with 177 correct matches. In comparison, at best only 47% of our dataset passed with a 40% precision requirement. The Mikolajczyk [23] results are, in fact, comparable to some of our best performing image pairs. Fig 7. shows an example of such an image pair that achieved a 47% precision with 274 correct matches. The study that is closest in scope to our work is Valgren and Lilienthal [34], who evaluate feature matching using SIFT and SURF on outdoor images with seasonal changes. They report results that are significantly better than ours; they found that SIFT was able to achieve an average precision of 89% with 248 correct matches. This suggests that the specific complexities of historic repeat photography (difference in cameras, unknown calibrations, photo degradation, landscape changes) make feature matching in this context a much more difficult problem. Application-specific considerations. Our database contains images with a large variability in their appearance. Fig. 7 shows an example of an image with numerous
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography
711
Fig. 6. Image pair that perform med relatively well, attaining over 200 matches with a precisioon of 47% (Hessian-Laplace). The green g dots represent correct matches.
Fig. 7. Image pair where only y 2 correct matches were found (Hessian Laplace). Note thatt the majority of the landscape hass changed, leaving very few stable features to be detected and matched. The circles representt the scale of the local features.
correct matches, where mo ost feature detectors performed well. On the other haand, some image pairs contain very v few stable features, and thus, the ability to find any correct matches at all is an achievement in itself. An example of such a landscappe is shown in Fig. 8. Only 2 corrrect matches were found near the horizon.
712
5
C. Gat et al.
Conclusion
This study reports on the quantitative evaluation of a set of state-of-the-art feature detectors in the context of repeat photography. Unlike most related work, the proposed study assesses the performance of feature detectors when intra-pair variations are uncontrolled and due to a variety of factors (landscape change, weather conditions, different acquisition sensors). There is no systematic way to model the factors inducing image change. Our study proposes a new evaluation framework and a new measure for performance evaluation (i.e. the pass rate) which enables the study of performance at various tolerance levels for three considered parameters: the distance threshold (in the nearest neighbor feature matching), the precision tolerance, and the number of correct matches tolerance. It was found that the Hessian-Laplace achieved the best performance overall. However, the DoG detector obtained the highest pass rate, albeit only at very low precision values. While the performance of the MSER detector has shown good results for viewpoint change on planar scenes in previous studies [25], [31], it performed poorly with the complex nature of our scenes. Future work will focus on manipulating the parameters of each studied detector in order to maximize the number of detected features. This manipulation needs to be done in a systematic way, by leveraging as much a priori information as possible.
References [1] Levere, D., Yochelson, B., Goldberger, P.: New York Changing: Revisiting Berenice Abbott’s New York. Princeton Architectural Press (2004) [2] McNutty, E.: Boston Then and Now. Thunder Bay Press (1999) [3] Klett, M., Manchester, E., Verburg, J.: Second view: the rephotographic survey project. University of New Mexico Press (Albuquerque) (1984) [4] Fox, W., Klett, M., Banjakian, K., Wolfe, B., Ueshina, T., Marshall, M.: Third View, Second Sights: a Rephotographic Survey of the American West. Musuem of New Mexico Press (2004) [5] Mountain Legacy Project, mountainlegacy.ca [6] MacLaren, I.S., Higgs, E., Zezulka-Mailloux, G.E.M.: Mapper of Mountains: MP Bridgland in the Canadian Rockies 1902-1930, p. 295. University of Alberta Press, Edmonton (2005) [7] Higgs, E.: Nature by design: people, natural process, and ecological restoration. MIT Press, Cambridge (2003) [8] Roush, W.: A substantial upward shift of the alpine treeline ecotone in the southern Canadian Rocky Mountains. MSc Thesis, pp. 1–175 (December 2009) [9] Rhemtulla, J., Hall, R., Higgs, E., Macdonald, S.: Eighty years of change: vegetation in the montane ecoregion of Jasper National Park, Alberta, Canada. Canadian Journal of Forest Research-Revue Canadienne De Recherche Forestiere 32(11), 2010–2021 (2002) [10] Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(5), 530–535 (1997) [11] Wang, W.X., Xu, H.L., Luo, D.J.: Image Auto-registration on Harris-Laplace Features. In: Third International Symposium on Intelligent Information Technology Application, IITA 2009, vol. 2, pp. 559–562 (2009)
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography
713
[12] Bae, S., Agarwala, A.: Computational rephotography. ACM Transactions on Graphics 29(3) (June 2010) [13] Mikolajczyk, K., Matas, J.: Improving Descriptors for Fast Tree Matching by Optimal Linear Projection. In: IEEE Int. Conf. on Computer Vision, ICCV, pp. 1–8 (2007) [14] Kadir, T., Zisserman, A., Brady, M.: An affine invariant salient region detector. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 228–241. Springer, Heidelberg (2004) [15] Mikolajczyk, K., Schmid, C.: Scale & Affine Invariant Interest Point Detectors. International Journal on Computer Vision 60(1) (July 2004) [16] Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision Conference, vol. 15, p. 50 (1988) [17] Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) [18] Tuytelaars, T., Gool, L.V.: Matching Widely Separated Views Based on Affine Invariant Regions. International Journal of Computer Vision 59(1) (2004) [19] Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(9) (1991) [20] Belongie, S., Malik, J.: Shape context: A new descriptor for shape matching and object recognition. In: Int. Conf. on Neural Information Processing Systems, NIPS (2000) [21] Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image descriptors. In: Int. Conf. Computer Vision and Pattern Recognition, CVPR (2004) [22] Carneiro, G., Jepson, A.D.: Flexible spatial models for grouping local image features. In: Int. Conf. Computer Vision and Pattern Recognition, CVPR (2004) [23] Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. on Pattern analysis and Machine Intelligence 27(10) (2005) [24] Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. International Journal of Computer Vision 37(2) (2000) [25] Fraundorfer, F., Bischof, H.: A novel performance evaluation method of local detectors on non-planar scenes. In: Computer Vision Pattern Recognition Workshop, CVPRW (2005) [26] Gil, A., Mozos, O., Ballesta, M., Reinoso, O.: A comparative evaluation of interest point detectors and local descriptors for visual slam. Machine Vision and Applications 21(6) (2009) [27] Moreels, P., Perona, P.: Evaluation of features detectors and descriptors based on 3D objects. International Journal of Computer Vision 73(3) (2007) [28] Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. In: Int. Conf. Computer Vision and Pattern Recognition (2003) [29] Snavely, N., Seitz, S.: Photo tourism: exploring photo collections in 3D. In: ACM SIGGRAPH (2006) [30] Schindler, G., Dellaert, F., Kang, S.B.: Inferring Temporal Order of Images From 3D Structure. In: Int. Conf. Computer Vision and Pattern Recognition (2007) [31] Mikolajczyk, K., Tuytelaars, T., Schmid, C.: A Comparison of Affine Region Detectors. International Journal of Computer Vision 65(1-2) (2005) [32] Carneiro, G., Jepson, A.D.: Phase-based local features. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 282–296. Springer, Heidelberg (2002) [33] Mikolajczyk, K., Leibe, B.: Local features for object class recognition. In: Int. Conf. on Computer Vision, ICCV (2005)
714
C. Gat et al.
[34] Valgren, C., Lilienthal, A.J.: SIFT, SURF & seasons: Appearance-based long-term localization in outdoor environments. Robotics and Autonomous Systems 58(2) (2010) [35] Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006) [36] Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. Image and Vision Computing 22(10) (2004) [37] Tuytelaars, T., Mikolajczyk, K.: Local Invariant Feature Detectors: A Survey. Foundations and Trends in Computer Graphics and Vision 3(3) (2007) [38] Haja, A., Jahne, B.: Localization accuracy of region detectors. In: Int. Conf. Computer Vision and Pattern Recognition, CVPR (2008) [39] Lindeberg, T.: Feature detection with automatic scale selection. International Journal of Computer Vision 30(2) (January 1998) [40] Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) ISBN: 0521540518
Controllable Simulation of Particle System Muhammad Rusdi Syamsuddin and Jinwook Kim Imaging Media Research Center, Korea Institute of Science and Technology
Abstract. We describe a method to control a motion of particles during physically based simulation. Specifically we design force fields which drive particles so as to shape a human figure. We define attraction, steering and repulsion force fields to achieve a target shape and avoid obstacles. We use key frames of a skeleton of a human figure as a target shape and distribute attraction points over the target shape as a reference to attract the particles. Compared to previous control techniques, our method is fast enough to run in real time and shows a stable behavior of particles not sacrificing plausible simulation. Since our approach is suitable for real time applications, a user can interact with particles or obstacles in a physically satisfiable manner during the simulation.
1
Introduction
In recent years, physically based animation is widely used for natural simulation. By specifying several physical parameters such as initial positions and velocities, realistic motion can be generated automatically by solving equations of Newton’s physics law. One of the simplest approaches assumes that concerning objects are passive. In this case, initial conditions are defined and a user executes forward dynamics simulation to watch what happens. Collapse of a stack of blocks or pouring water into the glass can be a good example of passive simulation. However, feature films or interactive computer games require something more than simply simulating collapse of thousands of blocks. Directors or game designers may want the blocks to behave according to their intention. Here the blocks become active objects. For examples, some of the blocks should be located at specific places after the collapse or even move in a lively manner. To make the concerning objects active, physically based animation need involve some mechanism to control it. Many researchers tried to define this control mechanism. Most of them were developed for intelligent objects such as humans and animals. And the others developed for unintelligent objects such as fluid (i.e. water [1] [2], smoke [3] [4]) or rigid bodies (i.e. blocks [5] [6] [7], hat [8]). For intelligent objects, a major problem is to determine how the muscles in a human character or other creatures should be actuated to achieve a desired motion [9]. In case of unintelligent objects, the problem is how to make the controlled objects perform a specific shape, character, or motion. Unfortunately, most of those control mechanism are not suitable for real time applications such as interactive games and virtual reality systems. Therefore, in G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 715–724, 2011. c Springer-Verlag Berlin Heidelberg 2011
716
M. Rusdi Syamsuddin and J. Kim
this paper we focus on real time control mechanism, specifically for active simulation of particle system which has not been studied extensively. We introduce the concept of force fields that control motion of particles to shape a human figure while the figure is changing under external forces. The remainder of this paper is structured as follows. In the next section we discuss relevant previous works. Section 3 describes our main contributions to control particle motion. We present experimental results in Section 4. Section 5 concludes the work.
2
Related Work
Research on particle simulation started in 1983 by Reeves [10]. This early work introduces technique to animate and render irregular objects like fire with particle systems. Later, several researchers used particle system to simulate fluid [11] and sand objects [12]. All of them focus on simulating passive particle. First active simulation of particle system was introduced by Reynolds in 1987 [13]. He developed simulation of flocks of bird-like objects called boids. The concept of boids extends Reevess particle system by changing dot-like particles with geometric objects that have orientations. Reynolds’ flocking mechanism uses three forces which depending on the distance and direction of the nearest boids. These forces are separation, alignment, and cohesion. Each of them has priority and the first priority force will be evaluated first. If it is non trivial, then the force is used. Otherwise the second priority force is considered, and so on [14]. Reynolds concept is good for crowd simulation [15] or static shape of flock simulation [16], but it is difficult to apply in physically based simulation framework. Therefore, instead of prioritized forces, we apply all forces for all particles and control magnitudes of the forces as a function of a distance between particles and the target shape. Other researchers tried to use particles to simulate active fluids [1] [17]. Several control techniques have been introduced for fluid objects. Limtrakul [18] divides them into two categories: path defining control and object defining control. In path defining control, a set of control particle is defined to leading its neighboring particles along the user-defined direction. This method is easy to be applied for interactive simulation. However it is difficult to find the position and a number of control points for realistic fluid motion. Another control method is based on a shape of objects. It allows users to provide a target shape and then the fluid flows automatically to form a user provided shape. several techniques like userspecified keyframes [19], driving force [3], adjoint method [20], potential field [21] and shape force feedback have been introduced to make sure that fluid fulfills the target shape. However, most of these techniques require time consuming processes and therefore are not suitable for real time applications. In this paper, we use particles for simulating non-fluid object. We focus on controlling particle as a rigid body which has not been studied well yet. In the view of controllable rigid body, many control methods perform only for static target shape. In case of dynamic target shape from rigid body, a few of
Controllable Simulation of Particle System
717
feature films[22] have shown similar final effects with our simulation. But those approaches cannot be applied to real time applications easily.
3
Method
There are two purposes of our control method. First is how to distribute particles to form a target shape. Second is how to make the particles dynamically stable in a target shape and make them show plausible behavior when a target shape changes or when they are hit by external forces in the simulation process. To achieve these purposes, we introduce how to distribute particles and design the force fields.
(a) Each groups of particles has a one-to-one relationship with a group of attraction points
(b) One particle is assigned to maximally three attraction points
Fig. 1. Attraction point assignment
3.1
Particles Distribution
There can be several ways to distribute particles to form a target shape. One way is to define several points in a target shape to attract the particles [2]. Based on these attraction points, the particles will be affected by an attraction force that moves particles to the target point. This force will be explained in the Section 3.2 Defining relationship between particles and attraction points is a tricky problem. We must consider efficiency in an attraction force computation and stability of particles while the target is moving. If one particle is assigned to all attraction points, then computation of attraction force becomes too expensive. Not just the simulation becomes slow; the particles may not form a designated shape. In other hand, if we assign one particle to one attraction point, we may lose the stability of particles when the target shape moves rapidly. Therefore, our solution is, we divide particles and attraction points into groups based on a human skeleton part or bones (e.g. hips, spine, right shoulder, etc.).
718
M. Rusdi Syamsuddin and J. Kim
One bone represents one attraction point group which is assigned to a specific group of particles. Suppose there are s sphere particles with r radius and b bones where the length of the bone is l. For each bone, there are a attraction points, where a = l/r . Here, we assume that a total number of minimum particles b to shape a human figure is s = i=1 ai . All particles will be distributed to all bones based on a value. In addition, we can increase a intentionally to emphasize a volumetric shape of the part. Also we assign one particle to maximally three attraction points which are adjacent to each other in the group to increase dynamic stability. Figure 1 shows an example of particle-attraction point assignment. 3.2
Force Fields Design
We describe several force fields that control particles dynamically. There are three force fields: Attraction Force (Fa ), Steering Force (Fs ), and Repulsion Force (Fr ). Each force fields has a different purpose. Attraction force keeps particles moving to an attraction pofaint. Steering force creates dynamic motion of particles during it moves toward an attraction point. Repulsion force controls particles to avoid obstacles. Let pp and pa be positions of a particle and an attraction point. Then a resultant force F to control the particle is F = Fa (pp , pa ) + Fs (pp , pa ) + Fr (pp ) .
(1)
Attraction Force. In order to keep a particle moving toward pa , a direction of the attraction force Fa is toward pa . To make particles stable when the target shapes are achieved, Fa must be high enough near pa . Therefore, we define ⎧ (d−α2 )2 ⎪ ⎨ α + (α − α ) exp− 2α23 Vˆ , d > α2 0 1 0 Fa (d) = , (2) ⎪ ⎩ ˆ , 0 < d < α2 α1 V where {α0 , α1 , α2 , α3 } are attraction force parameters, α0 < α1 , α3 = 0, and Vˆ is a unit vector toward pa . Suppose T be a time of the particle to reach pa . Then ,for t ∈ [0, T ], d and Vˆ are given by d(t) = pa − pp (t)
(3)
pa − pp (t) Vˆ (t) = . pa − pp (t)
(4)
In Equation 2, the minimum value of Fa is controlled by α1 and the maximum value is controlled by α0 . α2 and α3 control the range of maximum Fa from pa and the slope of the curve, respectively. A graph of Fa is shown in Figure 2.
Controllable Simulation of Particle System
719
Fig. 2. Graph of an attraction force
Steering Force. If we apply only an attraction force to particles, the motion would be too flat. In order to emphasize more stylish motion, we introduce a steering force which direction is not directed to pa and Fs is maximal when Fa becomes minimal. For d ≥ 0, we define
ˆ , (5) Fs (d) = β0 − exp(−β1 d+β2 ) U ˆ is a unit vector of arbiwhere {β0 , β1 , β2 } are steering force parameters and U ˆ trary direction or a rotated vector of V . Given {u0 , u1 , u2 } or a rotation matrix ˆ is defined as R and scaling parameter {γ0 , γ1 , γ2 }, U ⎧ [u0 u1 u2 ]T ⎪ ⎪ ⎪ ⎪ ⎨ ˆ = ⎡ γ0 0 0 ⎤ U . (6) ⎪ ⎪ ⎣ 0 γ1 0 ⎦ RVˆ ⎪ ⎪ ⎩ 0 0 γ2 In Equation 5, the maximum of Fs is controlled by β0 . The slope of the curve and the range of the maximum Fs are controlled by β1 and β2 , respectively. Figure 3 shows a graph of steering force. Repulsion Force. The last force field we are considering is a repulsion force Fr which is designed for a particle to avoid obstacles. Suppose that r is the distance between a particle and a repulsion point. Then we apply a push out force to the particle if the particle is near enough to locations of repulsion points. ⎧ ˆ , r < Dr ⎨ kW Fr = , (7) ⎩ 0 , r ≥ Dr ˆ is a unit vector from the repulsion where k is a repulsion force parameter, W point to the particle and Dr is an active distance of the repulsion force.
720
M. Rusdi Syamsuddin and J. Kim
Fig. 3. Graph of a steering force
Fig. 4. Example frames of a target shape acquired from motion capture data
4
Results and Discussion
We have embedded our control method into the framework of the physicallybased rigid body simulation. We tested it on Windows PC with an Intel Core i7 processor running at 2.67GHz and 4GB of RAM. Our system needs an input of a target shape which can be static or dynamic objects. In our experiment, we use 52 frames of motion capture data as a dynamic target shape. Figure 4 shows several frames from that data. Using this motion capture data, we implemented two scenarios. The first scenario is that particles move from initial positions to target positions while
Controllable Simulation of Particle System
(a) Direction of steering force pointed 45o to the down side of attraction force
(b) Direction of steering force pointed 45o to the right side of attraction force
721
722
M. Rusdi Syamsuddin and J. Kim
Fig. 5. Particles avoid an obstacle and converge at a target shape
the target shape changes following the captured motion(Figure 5(a), 5(b)). The second scenario is similar to the first one but there are obstacles and external forces from a user input disturbing particles’ motion.(Figure 5). In all scenarios we use 1326 particles with an integration time step of 0.017 seconds. Simulation runs in real time with more than 30 frames per second. Overall overhead to compute proposed force models is observed as less than 8 percent of all processing expense. From our experiment, we found that if α1 in an attraction force was too high, the magnitude of an attraction force increased as well when it approaches to attraction points. It result in an oscillatory behavior when a particle reaches to a target shape. However if the value is too small, the particles would not show a stable motion of a target shape when the shape changes quickly. One alternative way to reduce oscillatory unstable behavior, we can make a steering force to be small near the attraction points. However, if the rotation angle of R becomes large, it will make particles to rotate around the attraction points. Overall, we can predict the particles motion by controlling the shape of an attraction force magnitude or steering force magnitude in Figure 2 and 3. For example, if a graph of an attraction force is too stiff , then the motion of particles will be not smooth; it changes the direction too fast near the attraction point. To make the motion stable, we have to choose proper α3 besides α1 and α0 .
5
Conclusion
We described a technique to control particles to shape a human figure. By defining several force fields and distributing attraction points over the target shape, our method runs in real time with a stable behavior.
Controllable Simulation of Particle System
723
We plan to extend our work to provide an intuitive user interface to define a particle motion. Users will design their own particle’s trajectory and then the system will optimize various parameters in the force fields so as that particles move following the given path. Also another versatile force fields to make animation more stylish will be an interesting research topic. Acknowledgements. This work was supported in part by the IT R&D program of MKE/MCST/IITA (2008-F-033-02, Development of Real-time Physics Simulation Engine for e-Entertainment) and the Sports Industry R&D program of MCST(Development of VR based Tangible Sports System).
References 1. Th¨ urey, N., Keiser, R., Pauly, M., R¨ ude, U.: Detail-preserving fluid control. In: Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2006, pp. 7–12. Eurographics Association, Aire-la-Ville (2006) 2. Zhang, G., Zhu, D., Qiu, X., Wang, Z.: Skeleton-based control of fluid animation. Vis. Comput. 27, 199–210 (2011) 3. Fattal, R., Lischinski, D.: Target-driven smoke animation. ACM Trans. Graph. 23, 441–448 (2004) 4. Shi, L., Yu, Y.: Taming liquids for rapidly changing targets. In: Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2005, pp. 229–236. ACM, New York (2005) 5. Zickler, S., Veloso, M.: Tactics-based behavioural planning for goal-driven rigidbody control. Computer Graphics Forum 28, 2302–2314 (2009) 6. Twigg, C.D., James, D.L.: Many-worlds browsing for control of multibody dynamics. ACM Trans. Graph. 26 (2007) 7. Twigg, C.D., James, D.L.: Backward steps in rigid body simulation. ACM Trans. Graph. 27, 25:1–25:10 (2008) 8. Popovi´c, J., Seitz, S.M., Erdmann, M., Popovi´c, Z., Witkin, A.: Interactive manipulation of rigid body simulations. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2000, pp. 209–217. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA (2000) 9. Park, J., Fussell, D.S., Pandy, M., Browne, J.C.: Realistic animation using musculotendon skeletal dynamics and suboptimalcontrol. Technical report, University of Texas at Austin, Austin, TX, USA (1992) 10. Reeves, W.T.: Particle systems technique for modeling a class of fuzzy objects. ACM Trans. Graph. 2, 91–108 (1983) 11. M¨ uller, M., Charypar, D., Gross, M.: Particle-based fluid simulation for interactive applications. In: Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2003, pp. 154–159. Eurographics Association, Aire-la-Ville (2003) 12. Bell, N., Yu, Y., Mucha, P.J.: Particle-based simulation of granular materials. In: Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2005, pp. 77–86. ACM, New York (2005) 13. Reynolds, C.W.: Flocks, herds and schools: A distributed behavioral model. SIGGRAPH Comput. Graph. 21, 25–34 (1987) 14. Reynolds, C.: Steering Behaviors for Autonomous Characters. In: Game Developers Conference (1999)
724
M. Rusdi Syamsuddin and J. Kim
15. Pelechano, N., Allbeck, J.M., Badler, N.I.: Controlling individual agents in high-density crowd simulation. In: Proceedings of the 2007 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2007, pp. 99–108. Eurographics Association, Aire-la-Ville (2007) 16. Xu, J., Jin, X., Yu, Y., Shen, T., Zhou, M.: Shape-constrained flock animation. Comput. Animat. Virtual Worlds 19, 319–330 (2008) 17. Roh, B.-S., Kim, C.-H.: Controllable multi-phase smoke with lagrangian particles. In: Nishita, T., Peng, Q., Seidel, H.-P. (eds.) CGI 2006. LNCS, vol. 4035, pp. 115–123. Springer, Heidelberg (2006) 18. Limtrakul, S., Hantanong, W., Kanongchaiyos, P., Nishita, T.: Reviews on physically based controllable fluid animation. Engineering Journal 14, 41–52 (2010) 19. Treuille, A., McNamara, A., Popovi´c, Z., Stam, J.: Keyframe control of smoke simulations. ACM Trans. Graph. 22, 716–723 (2003) 20. McNamara, A., Treuille, A., Popovi´c, Z., Stam, J.: Fluid control using the adjoint method. ACM Trans. Graph. 23, 449–456 (2004) 21. Hong, J.M., Kim, C.H.: Controlling fluid animation with geometric potential: Research articles. Comput. Animat. Virtual Worlds 15, 147–157 (2004) 22. Ammann, C., Bloom, D., Cohen, J.M., Courte, J., Flores, L., Hasegawa, S., Kalaitzidis, N., Tornberg, T., Treweek, L., Winter, B., Yang, C.: The birth of sandman. In: ACM SIGGRAPH 2007 Sketches, SIGGRAPH 2007, ACM, New York (2007)
3D-City Modeling: A Semi-automatic Framework for Integrating Different Terrain Models Mattias Roupé and Mikael Johansson Visualization Technology, Chalmers University of Technology, SE-412 96 Göteborg, Sweden
Abstract. In recent years, many systems have been developed to handle realtime rendering of a 3D-city model and its terrain. These terrains are often constructed using one of the two main methods for representing terrain e.g. Image-based or geometry-based terrains. Both of these methods have its advantages and disadvantages, which are presented in this paper. However, by combining these methods and their advantages a more efficient modeling tool can be achieved. This paper presents a framework of how these two techniques can be linked and integrated by usage of the Graphics Processing Unit (GPU), which result in a user-friendlier and more efficient modeling process of terrain. The main objective with our framework is to address the difficulty of integrating a 3D model of a planned building with its surrounding ground, into a 3D-city model. Furthermore, the framework is also applicable on general 3D models that address the same issues regarding integration of 3D models into a terrain model.
1 Introduction The use of Virtual Reality (VR) in urban planning and building design has been seen by many as holding great potential to increase communication between different stakeholders in the process. However, to integrate such a VR workflow and system into the urban planning process sets high demands on tools and methods. These tools and methods have to fit into the existing planning pipeline in the urban planning and building design process. It is therefore important to have a 3D-city model in which new planned buildings and environments can be integrated in an efficient way. Nowadays, most of the new planned buildings are designed in specifically developed 3D systems that are called Building Information Model (BIM) software. The outcome from such a BIM system is a 3D model of the building and it surrounding ground. However, the technical integration of such 3D models into the 3D city model sets high demands on how different terrains are merged together and rendered in real-time. There is conversely no ultimate solution on how to create, edit and render the terrain. The specific purpose of applications determines what technical solutions are implemented. There are two mayor techniques for representing terrain i.e., as imagebased or geometry based models. The image-based approach is often called height-map, Digital Elevation Map (DEM) and height field and is the most common approximation of terrain meshes. Height-maps are raster images that are used to represent 3D data as a regular grid of the terrain surface. The pixels in the height-map represent the displacement of the corresponding mesh coordinate. The height-map G. Bebis et al. (Eds.): ISVC 2011, Part II, LNCS 6939, pp. 725–734, 2011. © Springer-Verlag Berlin Heidelberg 2011
726
M. Roupé and M. Johansson
only stores the displacement, e.g. the z-coordinate that is the height of the terrain. The x- and y- coordinate of the mesh corresponds to longitude and latitude correspondingly from the image regular grid of pixels. By using this technique on the GPU the mesh is allowed to be stored in one third of the original size. This is accomplished by only storing the z-values and using the corresponding offsets for the diffrent grid on the GPU. The main drawback with height-map is the fixed size property which overrepresents flat areas and under-represents varied terrains. This is mainly because grids cannot adapt to the variation of terrains due to their uniform nature. Consequently, this will give the same amount of data points for a flat land surface as a mountain. It can’t represent terrain mesh for features such as caves, overhangs, and has difficulties with areas with significantly elevation changes such as cliffs with sharp edges. When it comes to the usage of height-map in 3D-city models the demands of sharp edges has to be considered or errors will occur when the terrain and buildings interact with each other. This problem gives higher demands on high-resolution of the height-maps, and consequently performance issues and data size. The advantage with height-map technique is that the grid data is relatively easy for development of algorithms since grids have a fixed resolution and therefore can easily be stored in an index table data structure. It is also possible to create interactive editing tools for this type of terrains [1, 2, 3]. An advantage with height-map terrain is that it continues and gives no gap in the terrain that has to be modified. But when different height-maps are used in a patch-based system gaps can occur between the different patches of the terrain. The geometry based approach for representing terrain surface is the triangle model also called Triangulated irregular networks (TIN). TIN is a triangle mesh based on non-uniform spaced vertices. The advantage with TIN is that regions with little variation in surface height can be generated with more widely spaced points whereas in areas of more intense height variation the point density can be increased. The TIN represented terrain give the best approximation of a surface within a predefined triangle budget. TIN can also represent sharp edges and boundaries in the terrain better and are more flexible in this matter compared to the height-map. However, it is much more complicated to implement queries and algorithms on a TIN mesh. Working with TIN often setts high demands on the end user and is a very time consuming process to edit such a surface, because it is based on vertices and triangles that are not continued in the same matter as the height-map approach is. In this paper we will present a framework that facilitates the use of both the TIN and height-map approach for editing and rendering of the terrain in a 3D-city model. The main objective is to combine the two techniques and their respective benefits. The framework addresses the issue of adding and integrating new terrain features into existing terrain. A common problem addressing this issue is when 3D buildings (Building Information Models, BIM) and its surrounding terrain are integrated into the 3D-city model, see figure 1. Integrating such a model is a very time consuming process. However, our approach presents an almost automatic way of joining these types of models. The technique facilitates the advantage from the image-based modeling in constructing continues terrain surface without gaps or cracks. It is also possible to use dummy geometry to project the terrain, which facilitates a similar behavior as Constructive Solid Geometry [4] that makes it possible to do Boolean operations on the terrain in 2.5D. The terrain-editing framework is in some extent implemented on the GPU and therefore it achieves real-time rendering and modification of the terrain.
3D-City Modeling: A Semi-automatic Framework for Integrating Different Terrain Models
727
Fig. 1. The image is showing the issue of integrating new buildings (Building Information Models, BIM) into the 3D-city model. The problem that appears is that the old 3D-city terrain is cutting or draws above the new BIM ground.
The main contributions of this paper are: (1) to increase designers’ productivity, by solving the time consuming process of joining different terrain models such as an existing 3D-city Model with a 3D model of a building, (2) give designers’ more control over the modeling process, by combining geometry based modeling with imagebased modeling. The rest of the paper is organized as follows: The next sub-section presents related work of terrain rendering and modeling. Section 2 presents our approaches of semiautomatically joining of different terrain models followed by our implementation and an example of applying our method in real-context in a 3D-city model followed by a description of finalizing the terrain into a TIN. Section 3 includes conclusions, limitations and future work of this paper. 1.1 Related Work The problems associated with large terrains are that the dataset are often complex and too large to fit into the computers RAM or graphic cards memory and therefore has to be an approximation of the real terrain. The difficulties associated with real-time rendering of large terrains are therefore to handle this large amount of data [5, 6, 7]. Such an application has to simplify the geometry and use a paging or streaming system for the terrain geometry. The most common technique is to simplify the terrain and to create multi-resolutions of textures and geometries. By arranging them in a hierarchical structure and incorporating them in a Level-Of-Detail (LOD) system realtime rendering can be achieved [5, 6, 7]. This LOD system is view-dependent and handles which resolution of terrain should be shown. If the terrain is too large to fit into the memory, a paging or streaming system has to be combined with the LOD system [8, 9]. The paging or steaming system handles the allocation and freeing of
728
M. Roupé and M. Johansson
memory of the different LODs depending of what LODs are currently shown. The most common approach for this system is to use a height-map for the terrain, which gives a more stable frame rate because of the uniformed data structure. Another advantage with height-mapped terrain is that is possible to edit the terrain interactively by painting in the height-map with a 3D brush [1, 2, 3]. This type of interactive interface provides the designers with a more fast and intuitive way of sketching a 3D-model. The drawback with this type of method is that it is inadequate when it comes to control or one has little control over the outcome. Another draw back is that it lacks when it comes to workflow to manual modeling systems. In recent years, much research has focused on visualizing large landscapes where GIS and cartographic data have been projected into terrain [10, 11]. Bruneton et al. [10] present an approach that adapts the DEM to vector features, such as road and rivers. The most promising published work on landscape visualization is Vaaraniemi et al. [11], who are using a multi-layered interface for different land features that are sequenced after each other and rendered into the height-map. Santos et al. [12] used multi-layered height-maps from different viewpoints to construct 3D geometry. The result from this approach was a similar behavior as Constructive Solid Geometry, which makes it possible to do Boolean operations with the multi-layered height-maps. This type of representation is also suitable to model arbitrary solid geometry, such as scanned models. However, most of these methods are related to virtual landscapes that are viewed from above and are not as detailed as in urban environments. The drawback with the height-map rendering approach is that it is more suitable for large-scaled scenes where the viewer is not close to the ground. But in the context of 3D-city models, which are used in urban planning, it is vital to see the model from a pedestrian point of view. In these cases the sharp boundary in the urban environment has to be addressed. Therefore a very detailed height-map has to be used. In this context the TIN approach is very attractive, which can contain sharp boundaries in the urban environment and is more suitable for that reason. But working with TINs sets high demands on the end-users, who have to solve conflicts possibly emerging from interactions between different terrain features. The TIN approach therefore increases the costs of manual content creation.
2 Our Approach Our aim with the terrain-editing framework presented in this paper has been to semiautomatically solve conflicts emerging from interactions between the terrain and other features such as adding new terrains or buildings. The framework support joining two or more terrain models into one integrated terrain model. It also handles the common problem of integrating new buildings and its ground into the terrain of a 3D-city model. These types of problems are complex and not intuitive to solve when using manual modeling systems. Furthermore, because manual modeling systems are complex and require extensive 3D modeling experience they set high demands on the end user. It is also often a very time consuming process and sets enormous efforts to stitch together this type of models manually. Our terrain-editing framework is integrated into a Virtual Reality (VR) application that can visualize and edit 3D-city models. The
3D-City Modeling: A Semi-automatic Framework for Integrating Different Terrain Models
729
application, MrViz, has a patch or tiled based system that handles paging of the 3Dcity model. The system is arranged in a hierarchical structure with multi-resolution of textures and geometries of the buildings and the terrain. By cutting the model into smaller sub-patches, it facilitates the possibility to handle the 3D-city model interactively by using the paging system. The paging system is used in combination with the level-of-detail system that makes it possible to render huge 3D-city models that exceed the graphic memory. Each sub-patch in the paging system contains façade and terrain texture that are small enough to be displayable by the graphics hardware. The editing of the 3D-city model is done in an edit-mode. In the edit-mode, a highresolution sub-part of the 3D-city model is loaded through an aerial-palette. By doing this, the high-resolution sub-part model is small enough to fit into memory and displayable on the graphics hardware. Another advantage with this approach is that we have implemented a sub-version control system that tracks changes executed by different users. This facilitates and makes it possible for different users to work and collaborate with the same 3D-city model [13]. Our strategy has been to implement the terrain-editing framework into the editing mode of the application. Our objective has been to: • Find a semi-automatic way of joining different terrain models, such as an existing 3D-city model with a 3D model of a planned building with its surroundings, see figure 1. • Combine image-based and geometry based modeling that can enhance control for the 3D-artist. • Facilitate export of TIN models to other software’s 2.1 Semi-automatically Joining of Different Terrain Models By using a multi-layer approach, an almost automatic interface for joining different terrains can be achieved. The multi-layer approach is done by sequencing the terrain models after each other and uses the GPU to render them into depth-buffer. The result is a height-map that is representing the joint terrain. However, if the added new terrain is detailed, as in figure 1, with pavement and curbs it is possible to project the 3D-city model towards them and use this geometry as a representation of the terrain in these areas. It is also possible to render temporary geometries into height-maps and use this temporary 3D geometry to modify the height-map. If the height-maps for the different temporary geometries are sequenced after each other, a similar behavior as Constructive Solid Geometry can be achieved. By using this approach it is possible to do Boolean operations on the terrain in 2.5D. The limitation of this approach is the pixel resolution level, which can cause artifacts, such as jagged edges, if the resolution is no adequate. The advantage with this method is that it is easy to implement and does not need solid or “water-tight” primitive shapes or geometries, which is the main problem with Constructive Solid Geometry implementations. This method is an efficient tool when modeling and fixing artifacts that can appear then the two terrains are merged together. By combining the image-based modeling with geometry-based modeling, it is possible to facilitate more control for the 3D-artist. The most common problem in this context is when the different models have a gap and the 3D-artist has to fill or stitch the gap with a slope of some sort.
730
M. Roupé and M. Johansson
2.2 Our Implementation As mentioned before in this paper, our method utilizes the depth-buffer on the GPU to convert the TIN of sub-patch terrain into a height-map. Figure 2 shows a simplified view of the presented terrain-editing framework.
Fig. 2. Our approach of utilizes the depth buffer on the GPU to convert the TIN of sub-patch terrain into a height-map
Our implementation starts with that the user selects what terrain sub-patches are going to be editable. The next step for the user is to import the 3D-model that should be integrated into the 3D-city model. The 3D-city model uses a coordinate system defined by the City Authority of Gothenburg where positive x-direction represents true north alignment. This coordinate system is used in almost all urban planning projects in Gothenburg and the 3D-model that is to be imported is often already defined in the same coordinate system. If that is not the case, the user can define the placement and rotation, either interactively using the mouse, or by entering transformation and rotation data numerically. By using the bounding box for the terrain, an ortho-camera looking down on the terrain from above can be set up. By finding the maximum and minimum height of the terrain, the height-range can be calculated and be used later when calibrating and matching the height-map terrain towards the initial TIN terrain. Next step creates and renders the terrain into a 16-bit depth-buffer. The use of a 16-bit buffer gives better precisions when the value range is from 0 to 1. The height-map values are then converted and calibrated towards the initial TIN terrain height range (float z = (*HeightMapValue) / heightRange;). The 16-bit gives an approximation error depending on the terrain height range. 100 meters in height range give a height error of 1.5 mm, which is adequate for 3D-city models in which the error source of measurement of the initial terrain is often higher. In the context of visualization this amount of error is not noticeable in the view. The x and y coordinates from the height-map image are also converted by calculating the scale for the pixels. This results in that the corresponding x and y location for of each pixel is known. The next step is to create the terrain-shader and load it to the GPU for rendering. The GPU implementation was done in OpenGL and GLSL using vertex- and
3D-City Modeling: A Semi-automatic Framework for Integrating Different Terrain Models
731
pixel-shaders. The result from this method is an almost automatic joining of different terrain models. Another advantage with using the height-map is that is possible to edit the terrain by painting with 3D brushes [1, 2, 3]. That part of our implementation is not described in this paper. 2.3 Applying Our Method in Real-Context in a 3D-City Model The integration of 3D models, containing ground, with existing terrains is usually a very time consuming process. The new ground has to be stitch together with the existing TIN terrain from the 3D-city model. Our framework uses depth-buffer to render the new ground into the existing height-map from the 3D-city model, see figure 3.
Fig. 3. The figure is showing how our framework is using the depth-buffer to render the new ground into the existing height-map. The result is a merged terrain from the two models.
The advantage with the height-map is that the terrain continues with no gaps, which is a mayor problem when modeling with TINs. The limitation with this approach is that the resolution of the height-map sets the resolution of the terrain, which can give some artifacts such as jagged edges. The size of the terrain patch in our 3Dcity model is 200x200 meters, which give a resolution of about 0.1 meters, when the resolution of height-map is 2048 pixels in height and width. Using higher resolution on the height-mapped terrain in different situations can however solve this issue. Our system also supports cutting off sub-regions of TINs, which can be used for areas that have to be more detailed or when two detailed models are merged. It is also possible to project the 3D-city terrain against temporary geometry, described in section 2.1, which gives an interface that uses both image-based and geometry based modeling. This approach of using 3D geometry to edit the image-based terrain gives more control in the modeling process and could be used when the boundary-edges of the two
732
M. Roupé and M. Johansson
terrains do not match and have to be stitched together, see figure 4. The result is an efficient and useful interface where the users just have to create polygons that intersect with the two surfaces and do not have to find and use each corresponding vertex on the boundary, which would be the appropriate approach for TIN modeling, if no artifacts such as gaps and cracks would occur.
Fig. 4. Image 1: the initial two terrains that are going to be merged. Image 2: the result from the automatic merging of the two terrains. Image 3: our framework supports temporary geometry that can be rendered into the height-map and used to stitch or merge the two terrain surfaces together. Image 4 shows the final result from merging the two terrains.
The presented project in figure 4 is one of the student projects from an urban planning course. The students designed the buildings and their surrounding ground in a BIM-system. The final task was to integrate the BIM model into the 3D-city model, where the existing surrounding environment was represented. They did not have our presented semi-automatic framework/tool from this paper and spend 6-8 hours (depending on the design of the BIM-models) to integrate the two terrains. It was a very cognitive stressful task to understand the 3D terrains and how they where interacting with each other. The edges and boundaries of these terrains also had to be stitched together with the existent terrain from the 3D-city model, which was a timeconsuming task. We tested the same task with our semi-automatic framework/tool and the entire integration of the two terrains took less than half an hour. Fore the final touch we used the temporary geometries approach, presented in this paper, to stitch the two terrains together. 2.4 Finalizing the Terrain into a TIN The finalization stage in our framework is to convert the height-mapped terrain into at triangle mesh also called TIN. The reason for doing this is to facilitate the advantage from the TIN approach, so that that regions with little variation in surface height can be generated with more widely spaced points and therefore more compact. TINs can also represent sharp edges and boundaries in the terrain better and are more flexible in this matter than height-maps. To use height-maps in a 3D-city model would set too high demands on the resolution of the terrain, which is problematic when it comes to
3D-City Modeling: A Semi-automatic Framework for Integrating Different Terrain Models
733
real-time rendering in a VR application. Having the terrain as TINs also gives the flexibility to export the terrain geometry to other 3D-modelling software. This is the main reason why we chose to have our terrain patches as TINs in our system. The TIN construction is done by extracting the regular triangle grid into sub-patches and simplifying the grid by using of the edge-collapse algorithm [14]. The approach of using sub-patches is to overcome memory issues, and to makes it possible to do threading during the simplifying stage, and thereby achieves a faster simplifying process. The resulting TIN, from the edge-collapse algorithm, is good enough, but a better algorithm that is detecting and handles edges and boundaries from the original 3D-model would be a definitive solution. However, we have not found such an algorithm.
3 Conclusions This paper has presented a framework that semi-automatically solves conflicts that emerge when different terrain features and terrains are merged together. The framework supports merging of two or more terrain models into one integrated terrain model. This is done by combining the different approaches of image-based and geometrybased terrains and the related modeling technics. The advantage with the image-based approach, i.e. height-map terrain, is that the terrain continues and gives no gaps that have to be modified manually. Another advantage is that the grid data is relatively easy for writing algorithms since the grids have a fixed resolution and therefore can easily be stored in an index table data structure. The framework also supports the use of different temporary geometries that can be sequenced after each other and be rendered into a height-map. This approach presents similar behavior as the Constructive Solid Geometry technique, which makes it possible to perform Boolean operations on the terrain in 2.5D. The limitation with this approach is that the resolution of the height-map sets the resolution of the terrain, which is pixel based and can give some artifacts. However, this artifact can be overcome by cutting up the terrain into sub-patches that can support higher resolutions of the terrain. These modified terrain patches are later on finalized into TINs that can present the terrain in a more compacted manner. TINs can also represent sharp edges and boundaries in the terrain better than the height-map approach does. Detailed areas with the height-map approach would not be manageable in real-time in a 3D-city model as a result of the huge amount of data. TINs also support the capability to export the model to other modeling software and fits into our 3D-city modeling system because of this manner. We argue that TIN models are better for representing detailed areas in a 3D-city model. But the creation process of these types of models requires enormous manual efforts. Integration of both imagebased and geometry-based manual editing operations therefore seems to be a very promising and powerful target, combining the best advantages of two worlds. Future work would therefore aim in finding a better algorithm that converts the height-map into a TIN by the detection and handling of edges and boundaries from the original 3D-model. The benefits of our approach in terms of productivity are that it almost automatically solves conflicts that emerge when different terrain features and terrains are merged together, see figure 1, 3 and 4. Our approach was compared, with the manual
734
M. Roupé and M. Johansson
modeling approach used in an urban planning course. The result indicated that it was about 12-16 times faster to use our approach. Concluding we argue that our presented semi-automatic approach improves productivity and solves conflicts that emerge when different terrain features and terrains are jointed together.
References 1. Atlan, S., Garland, M.: Interactive multiresolution editing and display of large terrains. Computer Graphics Forum 25(2), 211–223 (2006) 2. Schneider, J., Boldte, T., Westermann, R.: Real- time editing, synthesis, and rendering of infinite landscapes on GPUs. In: Conference on Vision, Modeling, and Visualization (VMV 2006), pp. 153–160 (2006) 3. de Carpentier, G.J.P., Bidarra, R.: Interactive GPU-based procedural heightfield brushes. In: Proceedings of the 4th International Conference on Foundations of Digital Games (FDG 2009), pp. 55–62. ACM, New York (2009) 4. Requicha, A.A.G., Voelcker, H.B.: Constructive Solid Geometry, Production Automation Project Technical Memorandum TM-25 (1980) 5. Döllner, J., Baumann, K., Hinrichs, K.: Texturing techniques for terrainvisualization. In: Proc. of the 11th Ann. IEEE Visualization Conference (VIS 2000), pp. 227–234 (2000) 6. Cignoni, P., Ganovelli, F., Gobbetti, E., Marton, F., Ponchio, F., Scopigno, R.: BDAM — Batched Dynamic Adaptive Meshes for high performance terrain visualization. Computer Graphics Forum 22, 505–514 (2003) 7. Hua, W., Zhang, H., Lu, Y., Bao, H., Peng, Q.: Huge texture mapping for real-time visualization of large-scale terrain. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology (2004) 8. Lindstrom, P., Pascucci, V.: Visualization of large terrains made easy. In: Proceedings of the conference on Visualization 2001, pp. 363–371. IEEE Computer Society, Washington (2001) 9. Amara, Y., Marsault, X.: A GPU Tile-Load-Map architecture for terrain rendering: theory and applications. Journal The Visual Computer 25 (2009) 10. Bruneton, E., Neyret, F.: Real-time rendering and editing of vector-based terrains. In: Comput. Graph. Forum, Special Issue: Eurographics 2008, vol. 27, pp. 311–320 (2008) 11. Vaaraniemi, M., Treib, M., Westermann, R.: High-Quality Cartographic Roads on HighResolution DEMs. Journal of WSCG 19(1-3) (2011) 12. Santos, P., De Toledo, R., Gattass, M.: Solid height-map sets: modeling and visualization. In: SPM, ACM Solid and Physical Modeling Symposium 2008, Stony Brook, New York (2008) 13. Roupé, M., Johansson, M.: Supporting 3D City Modelling, Collaboration and Maintenance through an Open-Source Revision Control System. In: CAADRIA 2010 New Frontiers Conference, pp. 347–356 (2010) 14. Garland, M., Heckbert, P.S.: Surface Simplification Using Quadric Error Metric. In: Proc. SIGGRAPH 1997, pp. 209–216 (1997)
Author Index
Aarabi, Parham I-768 Abdelrahman, Mostafa II-607 Abdul-Massih, Michel II-627 Abidi, Mongi A. I-291 Abushakra, Ahmad II-310 Akimaliev, Marlen II-588 Ali, Asem II-607 Allen, James II-148 Ambrosch, Kristian I-168 Andrysco, Nathan II-239 Arigela, Saibabu II-75 Asari, Vijayan K. II-75, II-428 Attakitmongcol, K. II-436 Ayala, Orlando II-669
Babu, G.R. II-526 Bagley, B. I-461 Bai, Li I-738 Baltzakis, H. II-104 Barman, S.A. I-410 Batryn, J. I-461 Bauer, Christian I-214 Bebis, George II-516 Beer, Thomas II-681 Beichel, Reinhard I-214 Belhadj, Ziad I-236 Beneˇs, Bedˇrich II-239, II-627 Ben-Shahar, Ohad I-180 Berezowski, John I-508 Berry, David M. I-653 Besbes, Olfa I-236 Bezawada Raghupathy, Phanidhar II-180 Bimber, Oliver I-54, I-66 Birklbauer, Clemens I-66 Blasch, Erik I-738 Borst, Christoph W. II-45, II-180 Bosaghzadeh, A. II-545 Bottleson, Jeremy I-530 Boujemaa, Nozha I-236 Bradley, Elizabeth I-619 Branzan Albu, Alexandra II-259, II-701 Bryden, Aaron I-518
Bui, Alex I-1 Burch, Michael I-301, I-641 Burlick, Matt I-718 Camponez, Marcelo O. II-338 Camps, Octavia I-347 Cance, William II-55 Cerutti, Guillaume I-202 Cham, Tat-Jen I-78 Chan, Kwok-Ping I-596 Chau, Dennis II-13 Chavez, Aaron II-358 Cheesman, Tom I-653 Chelberg, David II-219, II-637 Chen, Genshe I-738 Chen, George I-551 Chen, Guang-Peng II-328 Chen, Jia II-408 Chen, Jianwen I-1 Chen, Xiankai I-551 Chen, Xiao I-431 Chen, Yang II-126, II-536 Chen, Yingju II-310 Cheng, Erkang II-486 Cheng, Irene I-508 Cheng, Shinko Y. II-126, II-536 Cheng, Ting-Wei II-190 Chien, Aichi I-392 Cho, Jason I-699 Cho, Sang-Hyun I-748 Cho, Woon I-291 Choe, Yoonsuck I-371, I-400 Choi, Byung-Uk II-578 Clark, C.M. I-461 Coming, Daniel S. II-33 Cong, Jason I-1 Coquin, Didier I-202 Cordes, Kai I-156 Danch, Daniel I-54 da Silva dos Santos, Carlos Demirci, M. Fatih II-588 Deng, Fuqin II-408 Denker, K. II-158
II-659
736
Author Index
Doretto, Gianfranco I-573 Dornaika, F. II-545 du Buf, J.M. Hans II-136 Duchaineau, Mark A. I-359 Ducrest, David L. II-45 Ehrmann, Alison I-653 Elgammal, Ahmed I-246 Elhabian, Shireen II-607 Fang, Hui I-102 Farag, Aly A. II-607 Febretti, Alessandro II-13 Fehr, Janis I-90, I-758 Feng, Zhan-Shen II-398 Fiorio, Christophe II-377 Forney, C. I-461 Forrester, J. I-461 Franz, M. II-158 Fraz, M.M. I-410 Fr¨ ohlich, Bernd I-269 Fukuda, Hisato II-116 Fukui, Kazuhiro II-555 Gambin, T. I-461 Gao, Yang II-328 Garbe, Christoph S. I-337, I-758 Garbereder, Gerrit II-681 Garc´ıa, Edwin R. II-627 Garg, Supriya I-629 Gat, Christopher II-701 Gaura, Jan II-567 Gdawiec, Krzysztof II-691 Geng, Zhao I-653 German, Daniel II-701 Getreuer, Pascal I-686 Ghandi, Nikhil II-219 Gibson, Christopher J. I-441 Gleicher, Michael I-518 Godin, Guy I-325 Gong, Yi I-281 Gonzalez, A. I-461 Goodman, Dean II-229 Gottfried, Jens-Malte I-758 Grammenos, D. II-104 Griguer, Yoram I-518 Grosse, Max I-54, I-66 Gruchalla, Kenny I-619 Grundh¨ ofer, Anselm I-54, I-66
Gschwandtner, Michael II-199 Gurney, Kevin Robert II-239 Gustafson, David II-358 Gutierrez, Marco Antonio II-659 Hamann, Bernd I-530 Hammal, Zakia I-586 Harris Jr., Frederick C. II-33 Hart, John C. I-102, I-699, II-85 Hatori, Yoshinori II-348 Haxhimusa, Yll II-280 Heidari, Amin I-768 Heinrich, Julian I-641 Heinrichs, Richard I-347 Hensler, J. II-158 Herout, Adam I-421 Hess-Flores, Mauricio I-359 Higgs, Eric II-701 Hirata Jr., Roberto II-659 Hoeppner, Daniel J. I-381 Hoppe, A. I-410 Hou, Jian II-398, II-597 Hsieh, Yu-Cheng II-190 Hu, Jing II-486 Huber, David II-126 Humenberger, Martin I-674 Husmann, Kyle I-709 Hussain, Muhammad II-516 Hwang, Sae II-320 Iandola, Forrest N. I-102, II-85 Imiya, Atsushi I-23, II-270 Inomata, Ryo I-325 Itoh, Hayato I-23 J¨ ahne, Bernd I-90 Jamal, Iqbal I-508 Jeong, Je-Chang II-95 Jeong, Jechang I-147 Jin, Liu I-551 Johansson, Mikael II-725 Johnson, Andrew II-13 Jones, M.D. II-249 Joy, Kenneth I. I-359 Jung, Kyungboo II-578 Kamali, Mahsa I-102, I-699, II-85 Kamberov, George I-718 Kamberova, Gerda I-718 Kambhamettu, Chandra II-669
Author Index Kampel, Martin II-446 Kang, Chaerin II-617 Kang, Hang-Bong I-748 Karkera, Nikhil II-148 Karydas, Lazaros I-718 Kashu, Koji II-270 Khosla, Deepak II-126, II-536 Kidsang, W. II-436 Kim, Jinwook II-715 Kim, Jonghwan II-387 Kim, Kyungnam II-126, II-536 Kim, Sung-Yeol I-291 Kim, Taemin I-709 Kim, Yonghoon I-147 Kim, Yoon-Ah II-617 Klima, Martin II-647 Knoblauch, Daniel I-359 Ko, Dong Wook II-578 Kobayashi, Yoshinori II-116, II-418 Kocamaz, Mehmet Kemal II-506 Koepnick, Steven II-33 Kogler, J¨ urgen I-674 Kolawole, Akintola II-496 Koneru, Ujwal II-209 Koschan, Andreas I-291 Kotarski, Wieslaw II-691 Koutlemanis, P. II-104 Koutsopoulos, Nikos I-259 Krumnikl, Michal II-567 Kuester, Falko I-359 Kuhlen, Torsten II-681 Kuijper, Arjan II-367 Kumar, Praveen II-526 Kumsawat, P. II-436 Kuno, Yoshinori II-116, II-418 Kuo, Yu-Tung I-484 Kurenov, Sergei II-55 Kwitt, Roland II-199 Kwon, Soon II-387 Lam, Roberto II-136 Lancaster, Nicholas II-33 Laramee, Robert S. I-653 Larsen, C. I-451 Leavenworth, William II-627 Lederman, Carl I-392 Lee, Byung-Uk II-617 Lee, Chung-Hee II-387 Lee, Dah-Jye I-541 Lee, Do-Kyung II-95
Lee, Dokyung I-147 Lee, Jeongkyu II-310 Lee, Sang Hwa II-578 Lehmann, Anke I-496 Lehr, J. I-461 Leigh, Jason II-13 Lenzen, Frank I-337 Lewandowski, Michal II-290 Li, Feng II-486 Li, Ya-Lin II-64 Liang, Zhiwen II-627 Lillywhite, Kirt I-541 Lim, Ser-Nam I-573 Lim, Young-Chul II-387 Lin, Albert Yu-Min II-229 Lin, Chung-Ching II-456 Ling, Haibin I-738, II-486 Lisowska, Agnieszka II-691 Liu, Damon Shing-Min II-190 Liu, Tianlun I-66 Liu, Zhuiguang I-530 Lowe, Richard J. I-192 Lu, Shuang I-23 Lu, Yan II-506 Luboschik, Martin I-472 Luczynski, Bart I-718 Luo, Gang I-728 Lux, Christopher I-269 Ma, Yingdong I-551 Macik, Miroslav II-647 Maier, Josef I-168 Mallikarjuna Rao, G. II-526 Mancas, Matei I-135 Mannan, Md. Abdul II-116 Mart´ınez, Francisco II-290 Mateevitsi, Victor A. II-13 McGinnis, Brad II-13 McVicker, W. I-461 Meisen, Tobias II-681 Mennillo, Laurent I-43 Mercat, Christian II-377 Mininni, Pablo I-619 Mishchenko, Ales II-476 Misterka, Jakub II-13 Mocanu, Bogdan I-607 Moeslund, T.B. I-451 Mohr, Daniel I-112 Mok, Seung Jun II-578 Montoya–Franco, Felipe I-664
737
738
Author Index
Moody-Davis, Asher I-43 Moratto, Zachary I-709 Morelli, Gianfranco II-229 Mourning, Chad II-219, II-637 Moxon, Jordan I-518 Mueller, Klaus I-629 Muhamad, Ghulam II-516 Muhammad, Najah II-516 M¨ uller, Oliver I-156 Navr´ atil, Jan I-421 Nebel, Jean-Christophe II-290 Nefian, Ara V. I-709 Ni, Karl I-347 Nishimoto, Arthur II-13 Nixon, Mark S. I-192 Novo, Alexandre II-229 Nykl, Scott II-219, II-637 Ofek, Eyal II-85 Ohkawa, Yasuhiro II-555 Omer, Ido II-85 Ostermann, J¨ orn I-156 Owen, Christopher G. I-410 Owen, G. Scott I-431 Padalkar, Kshitij I-629 Pan, Guodong I-596 Pan, Ling-Yan II-328 Pande, Amit II-526 Pankov, Sergey II-168 Parag, Toufiq I-246 Parishani, Hossein II-669 Pasqual, Ajith I-313 Passarinho, Corn´elia Janayna P. II-466 Peli, Eli I-728 Pereira de Paula, Luis Roberto II-659 Perera, Samunda I-313 Peskin, Adele P. I-381 Petrauskiene, V. II-300 Phillips Jr., George N. I-518 Pirri, Fiora I-135 Pizzoli, Matia I-135 Platzer, Christopher II-627 Popescu, Voicu II-239 Prachyabrued, Mores II-45 Pree, Wolfgang II-199 Prieto, Flavio I-664
Punak, Sukitti II-55 Pundlik, Shrinivas I-728 Qi, Nai-Ming II-398, II-597 Qureshi, Haroon I-54 Radloff, Axel I-472 Ragulskiene, J. II-300 Ragulskis, M. II-300 Rasmussen, Christopher II-506 Rast, Mark I-619 Razdan, Anshuman II-209 Redkar, Sangram II-209 Reimer, Paul II-259 Reinhard, Rudolf II-681 Remagnino, P. I-410 Rieux, Fr´ed´eric II-377 Rilk, Markus I-563 Rittscher, Jens I-573 Rohith, M.V. II-669 Rosebrock, Dennis I-563 Rosen, Paul II-239 Rosenbaum, Ren´e I-530 Rosenhahn, Bodo I-156 Rossol, Nathaniel I-508 Rotkin, Seth II-24 Roup´e, Mattias II-725 Rudnicka, Alicja R. I-410 Rusdi Syamsuddin, Muhammad II-715 Sablatnig, Robert II-280 Sakai, Tomoya I-23, II-270 Sakyte, E. II-300 Salles, Evandro Ottoni T. II-338, II-466 Sarcinelli-Filho, M´ ario II-338, II-466 Savidis, Anthony I-259 Sch¨ afer, Henrik I-337 Schmauder, Hansj¨ org I-301 Schulten, Klaus II-1 Schulze, J¨ urgen P. II-24, II-229 Schumann, Heidrun I-472, I-496 Seifert, Robert I-641 Serna–Morales, Andr´es F. I-664 Sgambati, Matthew R. II-33 Shaffer, Eric I-699 Shalunts, Gayane II-280 Shang, Lifeng I-596 Shang, Lin II-328 Shemesh, Michal I-180 Singh, Rahul I-43
Author Index Sips, Mike I-472 Skelly, Luke J. I-347 Slavik, Pavel II-647 Smith, T. I-461 Sojka, Eduard II-567 Spehr, Jens I-563 Srikaew, A. II-436 Staadt, Oliver I-496 Stone, John E. II-1 Stroila, Matei I-699 Stuelten, Christina H. I-381 Sulzbachner, Christoph I-674 Sun, Shanhui I-214 Suryanto, Chendra Hadi II-555 Tapu, Ruxandra I-224 Tavakkoli, Alireza II-496 Teng, Xiao I-78 Terabayashi, Kenji I-325 Th´evenon, J´erˆ ome II-290 Tomari, Razali II-418 Tominski, Christian I-496 Tong, Melissa I-686 Tougne, Laure I-202 Tsai, Wen-Hsiang I-484, II-64 Tzanetakis, George II-259 Uhl, Andreas II-199 Umeda, Kazunori I-325 Umlauf, G. II-158 Uyyanonvara, B. I-410 Vacavant, Antoine I-202 Vandivort, Kirby L. II-1 Vanek, Juraj I-421 Vasile, Alexandru N. I-347 Vassilieva, Natalia II-476 Velastin, Sergio II-290 Vese, Luminita A. I-1, I-392, I-686
Vijaya Kumari, G. II-526 Villasenor, John I-1 Wahl, Friedrich M. I-563 Wang, Cui II-348 Wang, Lian-Ping II-669 Wang, Michael Yu II-408 Wang, Yuan-Fang I-281 Wanner, Sven I-90 Weber, Philip P. II-229 Weiskopf, Daniel I-301, I-641 White, J. I-461 Wolf, Marilyn II-456 Wood, Zo¨e J. I-441, I-461 Wu, Xiaojun II-408 Wu, Yi I-738, II-486 Xiang, Xiang
I-11, I-124
Yan, Ming I-1, I-33 Yang, Huei-Fang I-371, I-400 Yang, Sejung II-617 Yang, Yong II-398, II-597 Yang, Yu-Bin II-328 Yin, Lijun II-148 Yoon, Sang Min II-367 Yu, Jingyi II-486 Zabulis, X. II-104 Zachmann, Gabriel I-112 Zaharia, Titus I-224, I-607 Zemˇc´ık, Pavel I-421 Zhang, Bo-Ping II-597 Zhang, Mabel Mengzi II-24 Zhang, Tong II-627 Zhang, Yao II-328 Zhou, Minqi II-428 Zhu, Ying I-431 Zhuo, Huilong II-627 Zweng, Andreas II-446
739