Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6493
Ron Kimmel Reinhard Klette Akihiro Sugimoto (Eds.)
Computer Vision – ACCV 2010 10th Asian Conference on Computer Vision Queenstown, New Zealand, November 8-12, 2010 Revised Selected Papers, Part II
13
Volume Editors Ron Kimmel Department of Computer Science Technion – Israel Institute of Technology Haifa 32000, Israel E-mail:
[email protected] Reinhard Klette The University of Auckland Private Bag 92019, Auckland 1142, New Zealand E-mail:
[email protected] Akihiro Sugimoto National Institute of Informatics Chiyoda, Tokyo 1018430, Japan E-mail:
[email protected]
ISSN 0302-9743 ISBN 978-3-642-19308-8 DOI 10.1007/978-3-642-19309-5
e-ISSN 1611-3349 e-ISBN 978-3-642-19309-5
Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011921594 CR Subject Classification (1998): I.4, I.5, I.2.10, I.2.6, I.3.5, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Coverpicture: Lake Wakatipu and the The Remarkables, from ‘Skyline Queenstown’ where the conference dinner took place. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The 2010 Asian Conference on Computer Vision took place in the southern hemisphere, in “The Land of the Long White Cloud” in Maori language, also known as New Zealand, in the beautiful town of Queenstown. If we try to segment the world we realize that New Zealand does not belong officially to any continent. Similarly, in computer vision we often try to define outliers while attempting to segment images, separate them to well-defined “continents” we refer to as objects. Thus, the ACCV Steering Committee consciously chose this remote and pretty island as a perfect location for ACCV2010, to host the computer vision conference of the most populated and largest continent, Asia. Here, on South Island we studied and exchanged ideas about the most recent advances in image understanding and processing sciences. Scientists from all well-defined continents (as well as ill-defined ones) submitted high-quality papers on subjects ranging from algorithms that attempt to automatically understand the content of images, optical methods coupled with computational techniques that enhance and improve images, and capturing and analyzing the world’s geometry while preparing for higher-level image and shape understanding. Novel geometry techniques, statistical-learning methods, and modern algebraic procedures rapidly propagate their way into this fascinating field as we witness in many of the papers one can find in this collection. For this 2010 issue of ACCV, we had to select a relatively small part of all the submissions and did our best to solve the impossible ranking problem in the process. We had three keynote speakers (Sing Bing Kang lecturing on modeling of plants and trees, Sebastian Sylwan talking about computer vision in production of visual effects, and Tim Cootes lecturing about modelling deformable object), eight workshops (Computational Photography and Esthetics, Computer Vision in Vehicle Technology, e-Heritage, Gaze Sensing and Interactions, Subspace, Video Event Categorization, Tagging and Retrieval, Visual Surveillance, and Application of Computer Vision for Mixed and Augmented Reality), and four tutorials. Three Program Chairs and 38 Area Chairs finalized the decision about the selection of 35 oral presentations and 171 posters that were voted for out of 739, so far the highest number of ACCV, submissions. During the reviewing process we made sure that each paper was reviewed by at least three reviewers, we added a rebuttal phase for the first time in ACCV, and held a three-day AC meeting in Tokyo to finalize the non-trivial acceptance decision-making process. Our sponsors were the Asian Federation of Computer Vision Societies (AFCV), NextWindow–Touch-Screen Technology, NICTA–Australia’s Information and Communications Technology (ICT), Microsoft Research Asia, Areograph–Interactive Computer Graphics, Adept Electronic Solutions, and 4D View Solutions.
VI
Preface
Finally, the International Journal of Computer Vision (IJCV) sponsored the Best Student Paper Award. We wish to acknowledge a number of people for their invaluable help in putting this conference together. Many thanks to the Organizing Committee for their excellent logistical management, the Area Chairs for their rigorous evaluation of papers, the Program Committee members as well as external reviewers for their considerable time and effort, and the authors for their outstanding contributions. We also wish to acknowledge the following individuals for their tremendous service: Yoshihiko Mochizuki for support in Tokyo (especially also for the Area Chair meeting), Gisela Klette, Konstantin Schauwecker, and Simon Hermann for processing the 200+ Latex submissions for these proceedings, Kaye Saunders for running the conference office at Otago University, and the volunteer students during the conference from Otago University and the .enpeda.. group at The University of Auckland. We also thank all the colleagues listed on the following pages who contributed to this conference in their specified roles, led by Brendan McCane who took the main responsibilities. ACCV2010 was a very enjoyable conference. We hope that the next ACCV meetings will attract even more high-quality submissions. November 2010
Ron Kimmel Reinhard Klette Akihiro Sugimoto
Organization
Steering Committee Katsushi Ikeuchi Tieniu Tan Chil-Woo Lee Yasushi Yagi
University of Tokyo, Japan Institute of Automation, Chinese Academy of Science, China Chonnam National University, Korea Osaka University, Japan
Honorary Chairs P. Anandan Richard Hartley
Microsoft Research India Australian National University, NICTA
General Chairs Brendan McCane Hongbin Zha
University of Otago, New Zealand Peking University, China
Program Chairs Ron Kimmel Reinhard Klette Akihiro Sugimoto
Israel Institute of Technology University of Auckland, New Zealand National Institute of Informatics, Japan
Local Organization Chairs Brendan McCane John Morris
University of Otago, New Zealand University of Auckland, New Zealand
Workshop Chairs Fay Huang Reinhard Koch
Ilan University, Yi-Lan, Taiwan University of Kiel, Germany
Tutorial Chair Terrence Sim
National University of Singapore
VIII
Organization
Demo Chairs Kenji Irie Alan McKinnon
Lincoln Ventures, New Zealand Lincoln University, New Zealand
Publication Chairs Michael Cree Keith Unsworth
University of Waikato, New Zealand Lincoln University, New Zealand
Publicity Chairs John Barron Domingo Mery Ioannis Pitas
University of Western Ontario, Canada Pontificia Universidad Cat´ olica de Chile Aristotle University of Thessaloniki, Greece
Area Chairs Donald G. Bailey Horst Bischof Alex Bronstein Michael S. Brown Chu-Song Chen Hui Chen Laurent Cohen Daniel Cremers Eduardo Destefanis Hamid Krim Chil-Woo Lee Facundo Memoli Kyoung Mu Lee Stephen Lin Kai-Kuang Ma Niloy J. Mitra P.J. Narayanan Nassir Navab Takayuki Okatani Tomas Pajdla Nikos Paragios Robert Pless Marc Pollefeys Mariano Rivera Antonio Robles-Kelly Hideo Saito
Massey University, Palmerston North, New Zealand TU Graz, Austria Technion, Haifa, Israel National University of Singapore Academia Sinica, Taipei, Taiwan Shandong University, Jinan, China University Paris Dauphine, France Bonn University, Germany Technical University Cordoba, Argentina North Carolina State University, Raleigh, USA Chonnam National University, Gwangju, Korea Stanford University, USA Seoul National University, Korea Microsoft Research Asia, Beijing, China Nanyang Technological University, Singapore Indian Institute of Technology, New Delhi, India International Institute of Information Technology, Hyderabad, India TU Munich, Germany Tohoku University, Sendai City, Japan Czech Technical University, Prague, Czech Republic Ecole Centrale de Paris, France Washington University, St. Louis, USA ETH Z¨ urich, Switzerland CIMAT Guanajuato, Mexico National ICT, Canberra, Australia Keio University, Yokohama, Japan
Organization
Yoichi Sato Nicu Sebe Stefano Soatto Nir Sochen Peter Sturm David Suter Robby T. Tan Toshikazu Wada Yaser Yacoob Ming-Hsuan Yang Hong Zhang Mengjie Zhang
The University of Tokyo, Japan University of Trento, Italy University of California, Los Angeles, USA Tel Aviv University, Israel INRIA Grenoble, France University of Adelaide, Australia University of Utrecht, The Netherlands Wakayama University, Japan University of Maryland, College Park, USA University of California, Merced, USA University of Alberta, Edmonton, Canada Victoria University of Wellington, New Zealand
Program Committee Members Abdenour, Hadid Achard, Catherine Ai, Haizhou Aiger, Dror Alahari, Karteek Araguas, Gaston Arica, Nafiz Ariki, Yasuo Arslan, Abdullah Astroem, Kalle August, Jonas Aura Vese, Luminita Azevedo-Marques, Paulo Bagdanov, Andy Bagon, Shai Bai, Xiang Baloch, Sajjad Baltes, Jacky Bao, Yufang Bar, Leah Barbu, Adrian Barnes, Nick Barron, John Bartoli, Adrien Baust, Maximilian Ben Hamza, Abdessamad BenAbdelkader, Chiraz Ben-ari, Rami Beng-Jin, AndrewTeoh
Benosman, Ryad Berkels, Benjamin Berthier, Michel Bhattacharya, Bhargab Biswas, Prabir Bo, Liefeng Boerdgen, Markus Bors, Adrian Boshra, Michael Bouguila, Nizar Boyer, Edmond Bronstein, Michael Bruhn, Andres Buckley, Michael Cai, Jinhai Cai, Zhenjiang Calder´ on, Jes´ us Camastra, Francesco Canavesio, Luisa Cao, Xun Carlo, Colombo Carlsson, Stefan Caspi, Yaron Castellani, Umberto Celik, Turgay Cham, Tat-Jen Chan, Antoni Chandran, Sharat Charvillat, Vincent
IX
X
Organization
Chellappa, Rama Chen, Bing-Yu Chen, Chia-Yen Chen, Chi-Fa Chen, Haifeng Chen, Hwann-Tzong Chen, Jie Chen, Jiun-Hung Chen, Ling Chen, Xiaowu Chen, Xilin Chen, Yong-Sheng Cheng, Shyi-Chyi Chia, Liang-Tien Chien, Shao-Yi Chin, Tat-Jun Chuang, Yung-Yu Chung, Albert Chunhong, Pan Civera, Javier Coleman, Sonya Cootes, Tim Costeira, JoaoPaulo Cristani, Marco Csaba, Beleznai Cui, Jinshi Daniilidis, Kostas Daras, Petros Davis, Larry De Campos, Teofilo Demirci, Fatih Deng, D. Jeremiah Deng, Hongli Denzler, Joachim Derrode, Stephane Diana, Mateus Didas, Stephan Dong, Qiulei Donoser, Michael Doretto, Gianfranco Dorst, Leo Duan, Fuqing Dueck, Delbert Duric, Zoran Dutta Roy, Sumantra
Ebner, Marc Einhauser, Wolfgang Engels, Christopher Eroglu-Erdem, Cigdem Escolano, Francisco Esteves, Claudia Evans, Adrian Fang, Wen-Pinn Feigin, Micha Feng, Jianjiang Ferri, Francesc Fite Georgel, Pierre Flitti, Farid Frahm, Jan-Michael Francisco Giro Mart´ın, Juan Fraundorfer, Friedrich Frosini, Patrizio Fu, Chi-Wing Fuh, Chiou-Shann Fujiyoshi, Hironobu Fukui, Kazuhiro Fumera, Giorgio Furst, Jacob Fusiello, Andrea Gall, Juergen Gallup, David Gang, Li Gasparini, Simone Geiger, Andreas Gertych, Arkadiusz Gevers, Theo Glocker, Ben Godin, Guy Goecke, Roland Goldluecke, Bastian Goras, Bogdan Gross, Ralph Gu, I Guerrero, Josechu Guest, Richard Guo, Guodong Gupta, Abhinav Gur, Yaniv Hajebi, Kiana Hall, Peter
Organization
Hamsici, Onur Han, Bohyung Hanbury, Allan Harit, Gaurav Hartley, Richard HassabElgawi, Osman Havlena, Michal Hayes, Michael Hayet, Jean-Bernard He, Junfeng Hee Han, Joon Hiura, Shinsaku Ho, Jeffrey Ho, Yo-Sung Ho Seo, Yung Hollitt, Christopher Hong, Hyunki Hotta, Kazuhiro Hotta, Seiji Hou, Zujun Hsu, Pai-Hui Hua, Gang Hua, Xian-Sheng Huang, Chun-Rong Huang, Fay Huang, Kaiqi Huang, Peter Huang, Xiangsheng Huang, Xiaolei Hudelot, Celine Hugo Sauchelli, V´ıctor Hung, Yi-Ping Hussein, Mohamed Huynh, Cong Phuoc Hyung Kim, Soo Ichimura, Naoyuki Ik Cho, Nam Ikizler-Cinbis, Nazli Il Park, Jong Ilic, Slobodan Imiya, Atsushi Ishikawa, Hiroshi Ishiyama, Rui Iwai, Yoshio Iwashita, Yumi
Jacobs, Nathan Jafari-Khouzani, Kourosh Jain, Arpit Jannin, Pierre Jawahar, C.V. Jenkin, Michael Jia, Jiaya Jia, JinYuan Jia, Yunde Jiang, Shuqiang Jiang, Xiaoyi Jin Chung, Myung Jo, Kang-Hyun Johnson, Taylor Joshi, Manjunath Jurie, Frederic Kagami, Shingo Kakadiaris, Ioannis Kale, Amit Kamberov, George Kanatani, Kenichi Kankanhalli, Mohan Kato, Zoltan Katti, Harish Kawakami, Rei Kawasaki, Hiroshi Keun Lee, Sang Khan, Saad-Masood Kim, Hansung Kim, Kyungnam Kim, Seon Joo Kim, TaeHoon Kita, Yasuyo Kitahara, Itaru Koepfler, Georges Koeppen, Mario Koeser, Kevin Kokiopoulou, Effrosyni Kokkinos, Iasonas Kolesnikov, Alexander Koschan, Andreas Kotsiantis, Sotiris Kown, Junghyun Kruger, Norbert Kuijper, Arjan
XI
XII
Organization
Kukenys, Ignas Kuno, Yoshinori Kuthirummal, Sujit Kwolek, Bogdan Kwon, Junseok Kybic, Jan Kyu Park, In Ladikos, Alexander Lai, Po-Hsiang Lai, Shang-Hong Lane, Richard Langs, Georg Lao, Shihong Lao, Zhiqiang Lauze, Francois Le, Duy-Dinh Le, Triet Lee, Jae-Ho Lee, Soochahn Leistner, Christian Leonardo, Bocchi Leow, Wee-Kheng Lepri, Bruno Lerasle, Frederic Li, Chunming Li, Hao Li, Hongdong Li, Stan Li, Yongmin Liao, T.Warren Lie, Wen-Nung Lien, Jenn-Jier Lim, Jongwoo Lim, Joo-Hwee Lin, Huei-Yung Lin, Weisi Lin, Wen-Chieh(Steve) Ling, Haibin Lipman, Yaron Liu, Cheng-Lin Liu, Jingen Liu, Ligang Liu, Qingshan Liu, Qingzhong Liu, Tianming
Liu, Tyng-Luh Liu, Xiaoming Liu, Yuncai Loog, Marco Lu, Huchuan Lu, Juwei Lu, Le Lucey, Simon Luo, Jiebo Macaire, Ludovic Maccormick, John Madabhushi, Anant Makris, Dimitrios Manabe, Yoshitsugu Marsland, Stephen Martinec, Daniel Martinet, Jean Martinez, Aleix Masuda, Takeshi Matsushita, Yasuyuki Mauthner, Thomas Maybank, Stephen McHenry, Kenton McNeill, Stephen Medioni, Gerard Mery, Domingo Mio, Washington Mittal, Anurag Miyazaki, Daisuke Mobahi, Hossein Moeslund, Thomas Mordohai, Philippos Moreno, Francesc Mori, Greg Mori, Kensaku Morris, John Mueller, Henning Mukaigawa, Yasuhiro Mukhopadhyay, Jayanta Muse, Pablo Nagahara, Hajime Nakajima, Shin-ichi Nanni, Loris Neshatian, Kourosh Newsam, Shawn
Organization
Niethammer, Marc Nieuwenhuis, Claudia Nikos, Komodakis Nobuhara, Shohei Norimichi, Ukita Nozick, Vincent Ofek, Eyal Ohnishi, Naoya Oishi, Takeshi Okabe, Takahiro Okuma, Kenji Olague, Gustavo Omachi, Shinichiro Ovsjanikov, Maks Pankanti, Sharath Paquet, Thierry Paternak, Ofer Patras, Ioannis Pauly, Olivier Pavlovic, Vladimir Peers, Pieter Peng, Yigang Penman, David Pernici, Federico Petrou, Maria Ping, Wong Ya Prasad Mukherjee, Dipti Prati, Andrea Qian, Zhen Qin, Xueyin Raducanu, Bogdan Rafael Canali, Luis Rajashekar, Umesh Ramalingam, Srikumar Ray, Nilanjan Real, Pedro Remondino, Fabio Reulke, Ralf Reyes, EdelGarcia Ribeiro, Eraldo Riklin Raviv, Tammy Roberto, Tron Rosenhahn, Bodo Rosman, Guy Roth, Peter
Roy Chowdhury, Amit Rugis, John Ruiz Shulcloper, Jose Ruiz-Correa, Salvador Rusinkiewicz, Szymon Rustamov, Raif Sadri, Javad Saffari, Amir Saga, Satoshi Sagawa, Ryusuke Salzmann, Mathieu Sanchez, Jorge Sang, Nong Sang Hong, Ki Sang Lee, Guee Sappa, Angel Sarkis, Michel Sato, Imari Sato, Jun Sato, Tomokazu Schiele, Bernt Schikora, Marek Schoenemann, Thomas Scotney, Bryan Shan, Shiguang Sheikh, Yaser Shen, Chunhua Shi, Qinfeng Shih, Sheng-Wen Shimizu, Ikuko Shimshoni, Ilan Shin Park, You Sigal, Leonid Sinha, Sudipta So Kweon, In Sommerlade, Eric Song, Andy Souvenir, Richard Srivastava, Anuj Staiano, Jacopo Stein, Gideon Stottinge, Julian Strecha, Christoph Strekalovskiy, Evgeny Subramanian, Ramanathan
XIII
XIV
Organization
Sugaya, Noriyuki Sumi, Yasushi Sun, Weidong Swaminathan, Rahul Tai, Yu-Wing Takamatsu, Jun Talbot, Hugues Tamaki, Toru Tan, Ping Tanaka, Masayuki Tang, Chi-Keung Tang, Jinshan Tang, Ming Taniguchi, Rinichiro Tao, Dacheng Tavares, Jo˜ao Manuel R.S. Teboul, Olivier Terauchi, Mutsuhiro Tian, Jing Tian, Taipeng Tobias, Reichl Toews, Matt Tominaga, Shoji Torii, Akihiko Tsin, Yanghai Turaga, Pavan Uchida, Seiichi Ueshiba, Toshio Unger, Markus Urtasun, Raquel van de Weijer, Joost Van Horebeek, Johan Vassallo, Raquel Vasseur, Pascal Vaswani, Namrata Wachinger, Christian Wang, Chen Wang, Cheng Wang, Hongcheng Wang, Jue Wang, Yu-Chiang Wang, Yunhong Wang, Zhi-Heng
Wang, Zhijie Wolf, Christian Wolf, Lior Wong, Kwan-Yee Woo, Young Wook Lee, Byung Wu, Jianxin Xue, Jianru Yagi, Yasushi Yan, Pingkun Yan, Shuicheng Yanai, Keiji Yang, Herbert Yang, Jie Yang, Yongliang Yi, June-Ho Yilmaz, Alper You, Suya Yu, Jin Yu, Tianli Yuan, Junsong Yun, Il Dong Zach, Christopher Zelek, John Zha, Zheng-Jun Zhang, Cha Zhang, Changshui Zhang, Guofeng Zhang, Hongbin Zhang, Li Zhang, Liqing Zhang, Xiaoqin Zheng, Lu Zheng, Wenming Zhong, Baojiang Zhou, Cathy Zhou, Changyin Zhou, Feng Zhou, Jun Zhou, S. Zhu, Feng Zou, Danping Zucker, Steve
Organization
Additional Reviewers Bai, Xiang Collins, Toby Compte, Benot Cong, Yang Das, Samarjit Duan, Lixing Fihl, Preben Garro, Valeria Geng, Bo Gherardi, Riccardo Giusti, Alessandro Guo, Jing-Ming Gupta, Vipin Han, Long Korchev, Dmitriy Kulkarni, Kaustubh Lewandowski, Michal Li, Xin Li, Zhu Lin, Guo-Shiang Lin, Wei-Yang
Liu, Damon Shing-Min Liu, Dong Luo, Ye Magerand, Ludovic Molineros, Jose Rao, Shankar Samir, Chafik Sanchez-Riera, Jordy Suryanarayana, Venkata Tang, Sheng Thota, Rahul Toldo, Roberto Tran, Du Wang, Jingdong Wu, Jun Yang, Jianchao Yang, Linjun Yang, Kuiyuan Yuan, Fei Zhang, Guofeng Zhuang, Jinfeng
ACCV2010 Best Paper Award Committee Alfred M. Bruckstein Larry S. Davis Richard Hartley Long Quan
Technion, Israel Institute of Techonlogy, Israel University of Maryland, USA Australian National University, Australia The Hong Kong University of Science and Technology, Hong Kong
XV
XVI
Organization
Sponsors of ACCV2010 Main Sponsor
The Asian Federation of Computer Vision Societies (AFCV)
Gold Sponsor
NextWindow – Touch-Screen Technology
Silver Sponsors
Areograph – Interactive Computer Graphics Microsoft Research Asia Australia’s Information and Communications Technology (NICTA) Adept Electronic Solutions
Bronze Sponsor
4D View Solutions
Best Student Paper Sponsor
The International Journal of Computer Vision (IJCV)
Best Paper Prize ACCV 2010 Context-Based Support Vector Machines for Interconnected Image Annotation Hichem Sahbi, Xi Li.
Best Student Paper ACCV 2010 Fast Spectral Reflectance Recovery Using DLP Projector Shuai Han, Imari Sato, Takahiro Okabe, Yoichi Sato
Best Application Paper ACCV 2010 Network Connectivity via Inference Over Curvature-Regularizing Line Graphs Maxwell Collins, Vikas Singh, Andrew Alexander
Honorable Mention ACCV 2010 Image-Based 3D Modeling via Cheeger Sets Eno Toeppe, Martin Oswald, Daniel Cremers, Carsten Rother
Outstanding Reviewers ACCV 2010 Philippos Mordohai Peter Roth Matt Toews Andres Bruhn Sudipta Sinha Benjamin Berkels Mathieu Salzmann
Table of Contents – Part II
Posters on Day 1 of ACCV 2010 Generic Object Class Detection Using Boosted Configurations of Oriented Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oscar Danielsson and Stefan Carlsson
1
Unsupervised Feature Selection for Salient Object Detection . . . . . . . . . . . Viswanath Gopalakrishnan, Yiqun Hu, and Deepu Rajan
15
MRF Labeling for Multi-view Range Image Integration . . . . . . . . . . . . . . . Ran Song, Yonghuai Liu, Ralph R. Martin, and Paul L. Rosin
27
Wave Interference for Pattern Description . . . . . . . . . . . . . . . . . . . . . . . . . . . Selen Atasoy, Diana Mateus, Andreas Georgiou, Nassir Navab, and Guang-Zhong Yang
41
Colour Dynamic Photometric Stereo for Textured Surfaces . . . . . . . . . . . . Zsolt Jank´ o, Ama¨el Delaunoy, and Emmanuel Prados
55
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Li, Wei Chen, Kaiqi Huang, and Tieniu Tan
67
Modeling Complex Scenes for Accurate Moving Objects Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianwei Ding, Min Li, Kaiqi Huang, and Tieniu Tan
82
Online Learning for PLSA-Based Visual Recognition . . . . . . . . . . . . . . . . . Jie Xu, Getian Ye, Yang Wang, Wei Wang, and Jun Yang Emphasizing 3D Structure Visually Using Coded Projection from Multiple Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Nakamura, Fumihiko Sakaue, and Jun Sato Object Class Segmentation Using Reliable Regions . . . . . . . . . . . . . . . . . . . Vida Vakili and Olga Veksler Specular Surface Recovery from Reflections of a Planar Pattern Undergoing an Unknown Pure Translation . . . . . . . . . . . . . . . . . . . . . . . . . . Miaomiao Liu, Kwan-Yee K. Wong, Zhenwen Dai, and Zhihu Chen Medical Image Segmentation Based on Novel Local Order Energy . . . . . . LingFeng Wang, Zeyun Yu, and ChunHong Pan
95
109 123
137 148
XVIII
Table of Contents – Part II
Geometries on Spaces of Treelike Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aasa Feragen, Francois Lauze, Pechin Lo, Marleen de Bruijne, and Mads Nielsen
160
Human Pose Estimation Using Exemplars and Part Based Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanchao Su, Haizhou Ai, Takayoshi Yamashita, and Shihong Lao
174
Full-Resolution Depth Map Estimation from an Aliased Plenoptic Light Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tom E. Bishop and Paolo Favaro
186
Indoor Scene Classification Using Combined 3D and Gist Features . . . . . Agnes Swadzba and Sven Wachsmuth
201
Closed-Form Solutions to Minimal Absolute Pose Problems with Known Vertical Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla
216
Level Set with Embedded Conditional Random Fields and Shape Priors for Segmentation of Overlapping Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuqing Wu and Shishir K. Shah
230
Optimal Two-View Planar Scene Triangulation . . . . . . . . . . . . . . . . . . . . . . Kenichi Kanatani and Hirotaka Niitsuma
242
Pursuing Atomic Video Words by Information Projection . . . . . . . . . . . . . Youdong Zhao, Haifeng Gong, and Yunde Jia
254
A Direct Method for Estimating Planar Projective Transform . . . . . . . . . Yu-Tseh Chi, Jeffrey Ho, and Ming-Hsuan Yang
268
Spatial-Temporal Motion Compensation Based Video Super Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yaozu An, Yao Lu, and Ziye Yan Learning Rare Behaviours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian Li, Timothy M. Hospedales, Shaogang Gong, and Tao Xiang Character Energy and Link Energy-Based Text Extraction in Scene Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jing Zhang and Rangachar Kasturi
282
293
308
A Novel Representation of Palm-Print for Recognition . . . . . . . . . . . . . . . . G.S. Badrinath and Phalguni Gupta
321
Real-Time Robust Image Feature Description and Matching . . . . . . . . . . . Stephen J. Thomas, Bruce A. MacDonald, and Karl A. Stol
334
Table of Contents – Part II
XIX
A Biologically-Inspired Theory for Non-axiomatic Parametric Curve Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guy Ben-Yosef and Ohad Ben-Shahar
346
Geotagged Image Recognition by Combining Three Different Kinds of Geolocation Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keita Yaegashi and Keiji Yanai
360
Modeling Urban Scenes in the Spatial-Temporal Space . . . . . . . . . . . . . . . . Jiong Xu, Qing Wang, and Jie Yang
374
Family Facial Patch Resemblance Extraction . . . . . . . . . . . . . . . . . . . . . . . . M. Ghahramani, W.Y. Yau, and E.K. Teoh
388
3D Line Segment Detection for Unorganized Point Clouds from Multi-view Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tingwang Chen and Qing Wang
400
Multi-View Stereo Reconstruction with High Dynamic Range Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feng Lu, Xiangyang Ji, Qionghai Dai, and Guihua Er
412
Feature Quarrels: The Dempster-Shafer Evidence Theory for Image Segmentation Using a Variational Framework . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Scheuermann and Bodo Rosenhahn
426
Gait Analysis of Gender and Age Using a Large-Scale Multi-view Gait Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasushi Makihara, Hidetoshi Mannami, and Yasushi Yagi
440
A System for Colorectal Tumor Classification in Magnifying Endoscopic NBI Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toru Tamaki, Junki Yoshimuta, Takahishi Takeda, Bisser Raytchev, Kazufumi Kaneda, Shigeto Yoshida, Yoshito Takemura, and Shinji Tanaka A Linear Solution to 1-Dimensional Subspace Fitting under Incomplete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanno Ackermann and Bodo Rosenhahn
452
464
Efficient Clustering Earth Mover’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . Jenny Wagner and Bj¨ orn Ommer
477
One-Class Classification with Gaussian Processes . . . . . . . . . . . . . . . . . . . . Michael Kemmler, Erik Rodner, and Joachim Denzler
489
A Fast Semi-inverse Approach to Detect and Remove the Haze from a Single Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Codruta O. Ancuti, Cosmin Ancuti, Chris Hermans, and Philippe Bekaert
501
XX
Table of Contents – Part II
Salient Region Detection by Jointly Modeling Distinctness and Redundancy of Image Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yiqun Hu, Zhixiang Ren, Deepu Rajan, and Liang-Tien Chia
515
Unsupervised Selective Transfer Learning for Object Recognition . . . . . . . Wei-Shi Zheng, Shaogang Gong, and Tao Xiang
527
A Heuristic Deformable Pedestrian Detection Method . . . . . . . . . . . . . . . . Yongzhen Huang, Kaiqi Huang, and Tieniu Tan
542
Gradual Sampling and Mutual Information Maximisation for Markerless Motion Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yifan Lu, Lei Wang, Richard Hartley, Hongdong Li, and Dan Xu Temporal Feature Weighting for Prototype-Based Action Recognition . . . Thomas Mauthner, Peter M. Roth, and Horst Bischof PTZ Camera Modeling and Panoramic View Generation via Focal Plane Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karthik Sankaranarayanan and James W. Davis Horror Image Recognition Based on Emotional Attention . . . . . . . . . . . . . Bing Li, Weiming Hu, Weihua Xiong, Ou Wu, and Wei Li Spatial-Temporal Affinity Propagation for Feature Clustering with Application to Traffic Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Yang, Yang Wang, Arcot Sowmya, Jie Xu, Zhidong Li, and Bang Zhang Minimal Representations for Uncertainty and Estimation in Projective Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang F¨ orstner Personalized 3D-Aided 2D Facial Landmark Localization . . . . . . . . . . . . . . Zhihong Zeng, Tianhong Fang, Shishir K. Shah, and Ioannis A. Kakadiaris
554 566
580 594
606
619 633
A Theoretical and Numerical Study of a Phase Field Higher-Order Active Contour Model of Directed Networks . . . . . . . . . . . . . . . . . . . . . . . . . Aymen El Ghoul, Ian H. Jermyn, and Josiane Zerubia
647
Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Zhu, Xu Zhao, Yun Fu, and Yuncai Liu
660
Multi-illumination Face Recognition from a Single Training Image per Person with Sparse Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Die Hu, Li Song, and Cheng Zhi
672
Table of Contents – Part II
Human Detection in Video over Large Viewpoint Changes . . . . . . . . . . . . Genquan Duan, Haizhou Ai, and Shihong Lao Adaptive Parameter Selection for Image Segmentation Based on Similarity Estimation of Multiple Segmenters . . . . . . . . . . . . . . . . . . . . . . . . Lucas Franek and Xiaoyi Jiang
XXI
683
697
Cosine Similarity Metric Learning for Face Verification . . . . . . . . . . . . . . . Hieu V. Nguyen and Li Bai
709
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
721
Generic Object Class Detection Using Boosted Configurations of Oriented Edges Oscar Danielsson and Stefan Carlsson School of Computer Science and Communications Royal Inst. of Technology, Stockholm, Sweden {osda02,stefanc}@csc.kth.se
Abstract. In this paper we introduce a new representation for shapebased object class detection. This representation is based on very sparse and slightly flexible configurations of oriented edges. An ensemble of such configurations is learnt in a boosting framework. Each edge configuration can capture some local or global shape property of the target class and the representation is thus not limited to representing and detecting visual classes that have distinctive local structures. The representation is also able to handle significant intra-class variation. The representation allows for very efficient detection and can be learnt automatically from weakly labelled training images of the target class. The main drawback of the method is that, since its inductive bias is rather weak, it needs a comparatively large training set. We evaluate on a standard database [1] and when using a slightly extended training set, our method outperforms state of the art [2] on four out of five classes.
1
Introduction
Generic shape-based detection of object classes is a difficult problem because there is often significant intra-class variation. This means that the class cannot be well described by a simple representation, such as a template or some sort of mean shape. If templates are used, typically a very large number of them is required [3]. Another way of handling intra-class variation is to use more meaningful distance measures than the Euclidean or Chamfer distances typically used to match templates to images. Then, the category can potentially be represented by a substantially smaller number of templates/prototypes [4]. A meaningful distance measure, for example based on thin-plate spline deformation, generally requires correspondences between the model and the image [4–6]. Although many good papers have been devoted to computing shape based correspondences [4,7], this is still a very difficult problem under realistic conditions. Existing methods are known to be sensitive to clutter and are typically computed using an iterative process of high computational complexity. Another common way of achieving a compact description of an object category is to decompose the class into a set of parts and then model each part independently. For shape classes, the parts are typically contour segments [5,8,9], making R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 1–14, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
O. Danielsson and S. Carlsson
such methods practical mainly for object classes with locally discriminative contours. However, many object classes lack such contour features. These classes might be better described by global contour properties or by image gradient structures that are not on the contour at all, as illustrated in figure 1.
Fig. 1. Intra-class shape variation too large for efficient template-based representation. No discriminative local features are shared among the exemplars of the class. However, exemplars share discriminative edge configurations.
In this paper we present an object representation that achieves a compact description of the target class without assuming the presence of discriminative local shape or appearance features. The main components of this representation are sparse and slightly flexible configurations of oriented edge features, built incrementally in a greedy error minimization scheme. We will refer to these configurations as weak classifiers. The final strong classifier is a linear combination of weak classifiers learnt using AdaBoost [10]. We also stress that the method can be trivially extended to include other types of features (for example corners or appearance features) when applicable. We use the attentional cascade of Viola and Jones for selection of relevant negative training examples and for fast rejection of negative test examples [11]. For detection we evaluate the slidingwindow approach and we also point out the possibility of a hierarchical search strategy. Feature values are in both cases computed efficiently using the distance transform [12]. We achieve detection performance comparable to state-of-the-art methods. Our main contributions are: (i) a generic, shape-based object class representation that can be learnt automatically in a discriminative framework from weakly labelled training examples, (ii) an efficient detection algorithm that combines an attentional cascade with efficient feature value computation and (iii) an evaluation showing very promising results on a standard dataset. In the following section we review other methods that use similar object representations. The rest of the paper is organized as follows. In section 3 we describe how our method represents the object category. In section 4 we describe the edge features and their computation. In sections 5 and 6 we describe the algorithms for learning and detection in detail. In sections 7 and 8 we present an experimental evaluation. Finally, in sections 9, 10 and 11, we analyze the results, suggest directions for future study and draw conclusions.
Object Class Detection Using Boosted Configurations of Oriented Edges
2
3
Related Work
In the proposed approach the target object category is represented by sparse, flexible configurations of oriented edges. Several other works have relied on similar representations. For example, Fleuret and Geman learn a disjunction of oriented edge configurations, where all oriented edges of at least one configuration are required to be present within a user-specified tolerance in order for detection to occur [13]. Wu et al. learn a configuration of Gabor basis functions, where each basis function is allowed to move within a user-specified distance from its preferred position in order to maximize its response [14]. The most similar representation is probably from Danielsson et al. [15], who use a combination of sparse configurations of oriented edges in a voting scheme for object detection. In their work, however, the number of configurations used to represent a class, the number of edges in each configuration and the flexibility tolerance of edges are all defined by the user and not learned automatically. They also use different methods for learning and detection. We argue that our representation essentially generalizes these representations. For example, by constraining the weak classifiers to use only a single edge feature we get a representation similar to that of Wu et al [14]. The proposed approach also extends the mentioned methods because the tolerance in edge location is learnt rather than defined by the user. Furthermore, being discriminative rather than generative, it can naturally learn tolerance thresholds with negative as well as positive parity, ie. it can also encode that an edge of a certain orientation is unlikely to be within a certain region. The use of a cascade of increasingly complex classifiers to quickly reject obvious negatives was pioneered by Viola and Jones [11,16]. The cascade also provides a means of focusing the negative training set on relevant examples during training. Each stage in the cascade contains an AdaBoost classifier, which is a linear combination of weak classifiers. Viola and Jones constructed weak classifiers by thresholding single Haar features, whereas our weak classifiers are flexible configurations of oriented edge features.
3
Object Class Representation
The key component of the object representation is the weak classifiers. A weak classifier can be regarded as a conjunction of a set of single feature classifiers, where a single feature classifier is defined by an edge feature (a location and orientation) along with a tolerance threshold and its parity. A single feature classifier returns true if the distance from the specified location to the closest edge with the specified orientation is within tolerance (i.e. it should be sufficiently small if the parity is positive and sufficiently large if the parity is negative). A weak classifier returns true if all its constituent single feature classifiers return true. A strong classifier is then formed by boosting the weak classifiers. Figure 2 illustrates how the output of the strong classifier is computed for two different examples of the target class. The output of the strong classifier is thresholded
4
O. Danielsson and S. Carlsson
w2
w1 +
w4 +
+
w1
= w1+w2+w4
w3 +
+
w4 +
= w1+w3+w4
Fig. 2. An object class is represented by a strong classifier, which is a linear combination of weak classifiers. Each weak classifier is a conjunction of single feature classifiers. Each single feature classifier is described by a feature (bars), a distance threshold (circles) and a parity of the threshold (green = pos. parity, red = neg. parity). The output of the strong classifier is the sum of the weights corresponding to the “active” weak classifiers, as illustrated for two different examples in the figure.
to determine class membership. The edge features will be described in the next section and the weak classifiers in section 5.1.
4
Edge Features
An edge feature defines a single feature classifier together with a threshold and its parity. In this section we define the edge feature and the feature value. We also describe how to compute feature values efficiently. An edge feature Fk = (xk , θk ) is defined by a location xk and orientation θk in a normalized coordinate system. The feature value, fk , is the (normalized) distance to the closest image edge with similar orientation (two orientations are defined as similar if they differ by less than a threshold, tθ ). In order to emphasize that feature values are computed by aligning the normalized feature coordinate system with an image, I, using translation t and scaling s, we write fk (I, t, s). See figure 3 for an illustration.
ε(I) s . f3 F1 F3 F2
s . f1 t
s
s . f2
Fig. 3. Feature values fk (I, t, s) are computed by translating (t) and scaling (s) features Fk = (xk , θk ) and taking the (normalized) distances to the closest edges with similar orientations in the image
Object Class Detection Using Boosted Configurations of Oriented Edges
εθ(I)
1
dθ(I,p) 3
ε(I)
I
5
2 3
Fig. 4. Images are preprocessed as follows: (1) extract a set of oriented edges E (I), (2) split E (I) into (overlapping) subsets Eθ (I) and (3) compute the distance transforms dθ (I, p).
In order to define fk (I, t, s), let E(I) = {. . . , (p , θ ) , . . .} be the set of oriented edge elements (or edgels) in image I (E(I) is simply computed by taking the output of any edge detector and appending the edge orientation at each edge point) and let Eθ (I) = (p , θ ) ∈ E(I) |cos(θ − θ)| ≥ cos(tθ ) be the set of edgels with orientation similar to θ. We can then define fk (I, t, s) as: fk (I, t, s) =
1 · min ||s · xk + t − p || s (p ,θ )∈Eθk (I)
(1)
Typically we define a set of features F by constructing a uniformly spaced grid in the normalized coordinate system, i.e. F = X × Y × Θ, where X , Y and Θ are uniformly spaced points. 4.1
Efficient Computation of Feature Values
The feature value fk (I, t, s) defined in the previous section can be computed using only a single array reference and a few elementary operations. The required preprocessing is as follows. Start by extracting edges E(I) from the input image I. This involves running an edge detector on the image and then computing the orientation of each edge point. The orientation is orthogonal to the image gradient and unsigned, i.e. defined modulo π. For edge points belonging to an edge segment, we compute the orientation by fitting a line to nearby points on the segment. We call the elements of E(I) edgels (edge elements). The second step of preprocessing involves splitting the edgels into several overlapping subsets Eθ (I), one subset for each orientation θ ∈ Θ (as defined in the previous section). Then compute the distance transform dθ (I, p) on each subset [12]: dθ (I, p) =
min
(p ,θ )∈Eθ (I)
||p − p ||
(2)
6
O. Danielsson and S. Carlsson
This step is illustrated in figure 4; in the first column of the figure the subsets corresponding to horizontal and vertical orientations are displayed as feature maps and in the last column the corresponding distance transforms dθ (I, p) are shown. Feature values can be efficiently computed as: fk (I, t, s) = dθk (I, s · xk + t)/s. This requires only one array reference and a few elementary operations.
5
Learning
The basic component of our object representation is the weak classifier (section 5.1). It is boosted to form a strong classifier (section 5.2). Finally, a cascade of strong classifiers is built (section 5.3). 5.1
Learning the Weak Classifiers
In this section we describe how to learn a weak classifier given a weighted set of training examples. The goal of the learner is to find a weak classifier with minimal (weighted) classification error on the training set. The input to the training algorithm is a set of training examples {Ij |j ∈ J } with target class cj ∈ {0, 1} and a weight distribution, {dj |j ∈ J }, j∈J dj = 1, over these examples. Each training image is annotated with the centroid tj and scale sj of the object in the image. A weak classifier is the conjunction of a set of single feature classifiers. A single feature classifier is defined by a feature, Fk , a threshold, t, and a parity, p ∈ {−1, 1}. The single feature classifier has the classification function g (I, t, s) = p · fk (I, t, s) ≤ p · t and the weak classifier thus has the classiN N fication function h (I, t, s) = i=1 gi (I, t, s) = i=1 (pi · fki (I, t, s) ≤ pi · ti ). The number of single feature classifiers, N , is learnt automatically. We want to find the weak classifier that minimizes the classification error, e, wrt. the current importance distribution: dj (3) e = Pj∼d (h (Ij , tj , sj ) = cj ) = {j∈J |h(Ij ,tj ,sj )=cj }
We use a greedy algorithm to incrementally nconstruct a weak classifier. In order to define this algorithm, let hn (I, t, s) = i=1 gi (I, t, s) be the conjunction of the first n single feature classifiers. Then let Jn = {j ∈ J |hn (Ij , tj , sj ) = 1} be the indices of all training examples classified as positive by hn and let en be the classification error of hn : en = dj (4) {j∈J |hn (Ij ,tj ,sj )=cj }
We can write en in terms of en−1 as: en = en−1 + dj {j∈Jn−1 |gn (Ij ,tj ,sj )=cj }
−
dj
{j∈Jn−1 |cj =0}
(5)
Object Class Detection Using Boosted Configurations of Oriented Edges
7
This gives us a recipe for learning the nth single feature classifier: gn = arg min dj g
(6)
{j∈Jn−1 |g(Ij ,tj ,sj )=cj }
We keep gn if en < en−1 , which can be evaluated easily using equation 5, otherwise we let N = n − 1 and stop. The process is illustrated in figure 5: in a) the weak classifier is not sufficiently specific to delineate the class manifold well, in b) the the weak classifier is too specific and thus inefficient and prone to overfitting, whereas in c) the specificity of the weak classifier is tuned to the task at hand.
-n-1 (a) Too generic
(b) Too specific
-n
-N
(c) Optimal
Fig. 5. The number of single feature classifiers controls the specificity of a weak classifier. a) Using too few yields weak classifiers that are too generic to delineate the class manifold (gray). b) Using too many yields weak classifiers that are prone to overfitting and that are inefficient at representing the class. c) We learn the optimal number of single feature classifiers for each weak classifier (the notation is defined in the text).
Learning a Single Feature Classifier. Learning a single feature classifier involves selecting a feature Fk from a “dictionary” F , a threshold t from R+ and a parity p from {−1, 1}. An optimal single feature classifier, that minimizes equation (6), can be found by evaluating a finite set of candidate classifiers [17]. However, typically a large number of candidate classifiers have to be evaluated, leading to a time consuming learning process. Selecting the unconstrained optimum might also lead to overfitting. We have observed empirically that our feature values (being nonnegative) tend to be exponentially distributed. This suggests selecting the threshold t for a particular feature as the intersection of two exponential pdfs, where μ+ is the (weighted) average of the feature values from the positive examples and μ− is the (weighted) average from the negative examples. − μ− μ+ μ (7) · t = ln μ+ μ− − μ+ The parity is 1 if μ+ ≤ μ− and -1 otherwise. We thus have only one possible classifier for each feature and can simply loop over all features and select the classifier yielding the smallest value for equation (6). This method for single feature classifier selection yielded good results in evaluation and we used it in our experiments.
8
5.2
O. Danielsson and S. Carlsson
Learning the Strong Classifier
We use asymmetric AdaBoost for learning the strong classifiers at each stage of the cascade [16]. This requires setting a parameter k specifying that false negatives cost k times more than false positives. We empirically found k = 3n− /n+ to be a reasonable choice. Finally the detection threshold of the strong classifier is adjusted to yield a true positive rate (tpr) close to one. 5.3
Learning the Cascade
The cascade is learnt according to Viola and Jones [11]. After the construction of each stage we run the current cascade on a database of randomly selected negative images to acquire a negative training set for the next stage. It is important to use a large set of negative images and in our experiments we used about 300.
6 6.1
Detection Sliding Window
The simplest and most direct search strategy is the sliding window approach in scale-space. Since the feature values are computed quickly and we are using a cascaded classifier, this approach yields an efficient detector. Typically we use a multiplicative step-size of 1.25 in scale, so that si+1 = 1.25 · si , and an additive step-size of 0.01 · s in position, so that xi+1 = xi + 0.01 · s and yi+1 = yi + 0.01 · s. Non-maximum suppression is used to remove overlapping detections. There are some drawbacks with the sliding window approach: (i) all possible windows have to be evaluated, (ii) the user is required to select a step-size for each search space dimension, (iii) the detector has to be made invariant to translations and scalings of at least half a step-size (typically by translating and scaling all training exemplars) and (iv) the output detections have an error of about half a step-size in each dimension of the search space (if the step-sizes are sufficiently large to yield an efficient detector these errors typically translate to significant errors in the estimated bounding box). Since the feature values fk represent distances, we can easily compute bounds (u) (l) (l) fk and fk , given a region S in search space, such that fk ≤ fk (I, t, s) ≤ (u) fk ∀ (t, s) ∈ S. This can be used to for efficient search space culling and should enable detectors with better precision at the same computational cost. We will explore the issue of hierarchical search in future work. 6.2
Estimating the Aspect Ratio
The detector scans the image over position and scale, but in order to produce a good estimate of the bounding box of a detected object we also need the aspect ratio (which typically varies significantly within an object class). We
Object Class Detection Using Boosted Configurations of Oriented Edges
9
estimate the aspect ratio of a detected object by finding one or a few similar training exemplars and taking the average aspect ratio of these exemplars as the estimate. We retrieve similar training exemplars using the weak classifiers that are turned “on” by the detected object. Each weak classifier, hk , is “on” for a subset of the positive training exemplars, Tk = {x|hk (x) = 1} and by requiring several weak classifiers to be “on” we get the intersection of the corresponding subsets: Tk1 ∩Tk2 = {x|hk1 (x) = 1 ∧ hk2 (x) = 1}. Thus we focus on a successively smaller subset of training exemplars by selecting weak classifiers that fire on the detected object, as illustrated in figure 6. We are searching for a non-empty subset of minimal cardinality and the greedy approach at each iteration is to specify the weak classifier that gives maximal reduction in the cardinality of the “active” subset (under the constraint that the new “active” subset is non-empty). h1 a
d b
c f
g
h1
h2
e
h3
h
a
d c
b
f
j i
(a)
g
h1
h2
e
h3
h
a
d b
j
f
i
(b)
h2
e c g
h3
h j i
(c)
Fig. 6. Retrieving similar training exemplars for a detected object (shown as a star). a) Weak classifiers h2 and h3 fire on the detection. b) Weak classifier h3 is selected, since it gives a smaller “active” subset than h2 . c) Weak classifier h2 is selected. Since no more weak classifiers can be selected, training exemplars “g” and “h” are returned.
7
Experiments and Dataset
We have evaluated object detection performance on the ETHZ Shape Classes dataset [1]. This dataset is challenging due to large intra-class variation, clutter and varying scales. Several other authors have evaluated their methods on this dataset: in [5] a deformable shape model learnt from images is evaluated, while hand drawn models are used in [1,18–20]. More recent results are [2,21,22], where [2] reports the best performance. Experiments were performed on all classes in the dataset and results are produced as in [5], using 5-fold cross-validation. We build 5 different detectors for each class by randomly sampling 5 subsets of half the images of that class. All other images in the dataset are used for testing. The dataset contains a total of 255 images and the number of class images varies from 32 to 87. Thus, the number of training images will vary from 16 to 43 and the test set will consist of about 200 background images and 16 to 44 images containing occurrences of the target object.
10
O. Danielsson and S. Carlsson
Bottles
Giraffes 1 0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
DR
1 0.9
DR
DR
Applelogos 1 0.9
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0 0
0.5
1
1.5
0.1 0
0
0.5
1
1.5
(a) Applelogos
0.5
1
1.5
FPPI
(b) Bottles
(c) Giraffes
Mugs
Swans
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
DR
DR
0
FPPI
FPPI
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0
0.5
1
FPPI
(d) Mugs
1.5
0
0.5
1
1.5
FPPI
(e) Swans
Fig. 7. Detection rate (DR) plotted versus false positives per image (FPPI). The results from the initial experiment are shown by the blue (lower) curves and the results from the experiment using extended training sets are shown by the red (upper) curves.
Quantitative results are presented as the detection rate (the number of true positives divided by the number of occurrences of the target object in the test set) versus the number of false positives per image (FPPI). We prefer using FPPI, instead of precision, as a measure of error rate, since it is not biased by the number of positive and negative examples in the test set. Precision would for example be unsuitable for comparison to methods evaluated using a different cross-validation scheme, since this might affect the number of positive test examples. As in [5], a detection is counted as correct if the detected bounding box overlaps more than 20 % with the ground truth bounding box. Bounding box overlap is defined as the area of intersection divided by the area of union of the bounding boxes.
8
Results
Quantitative results are plotted in figure 7. The results of the initial experiment are shown by the blue (lower) curves. The problem is that there are very few training examples (from 16 to 43). Since the presented method is statistical in nature it is not expected to perform well when training on a too small training set. Therefore we downloaded a set of extra training images from Google Images. These images contained a total of 40 applelogos, 65 bottles, 160 giraffes, 67 mugs and 103 swans, which were used to extend the training sets. The red
Object Class Detection Using Boosted Configurations of Oriented Edges
(a) Applelogos true positives
(c) Bottles true positives
(b) Applelogos false positives
(d) Bottles false positives
(e) Giraffes true positives
(f) Giraffes false positives
(g) Mugs true positives
(h) Mugs false positives
(i) Swans true positives
11
(j) Swans false positives
Fig. 8. Example detections (true and false positives) for each class.
(upper) curves are the results using the extended training set and the detection performance is now almost perfect. In table 1 we compare the detection performance of our system to previously presented methods. Figure 8 shows some sample detections (true and false positives).
12
O. Danielsson and S. Carlsson
Table 1. Comparison of detection performance. We state the average detection rate at 0.3 and 0.4 FPPI and its standard deviation in parentheses. We compare to the systems of [2,5].
[email protected] [email protected] [5]@0.4 [2]@0.3 [2]@0.4
FPPI: FPPI: FPPI: FPPI: FPPI:
A. logos 95.5(3.2) 95.5(3.2) 83.2(1.7) 95.0 95.0
Bottles 91.9(4.1) 92.6(3.7) 83.2(7.5) 92.9 96.4
Giraffes 92.9(1.9) 93.3(1.6) 58.6(14.6) 89.6 89.6
Mugs 96.4(1.4) 97.0(2.1) 83.6(8.6) 93.6 96.7
Swans 98.8(2.8) 100(0) 75.4(13.4) 88.2 88.2
Our currently unoptimized MATLAB/mex implementation of the sliding window detector, running (a single thread) on a 2.8 GHz Pentium D desktop computer, requires on the order of 10 seconds for scanning a 640 x 480 image (excluding the edge detection step). We believe that this could be greatly improved by making a more efficient implementation. This algorithm is also very simple to parallelize, since the computations done for one sliding window position are independent of the computations for other positions.
9
Discussion
Viola and Jones constructed their weak classifiers by thresholding single Haarfeatures in order to minimize the computational cost of weak classifier evaluation [16]. In our case, it is also a good idea to put an upper limit on the number of features used by the weak classifiers in the first few stages of the cascade for two reasons: (1) it reduces the risk of overfitting and (2) it improves the speed of the learnt detector. However, a strong classifier built using only single-feature weak classifiers should only be able to achieve low false positive rates for object classes with small intra-class variability. Therefore later stages of the cascade should not limit the number of features used by each weak classifier (rather, the learner should be unconstrained).
10
Future Work
As mentioned, search space culling and hierarchical search strategies will be explored, ideally with improvements in precision and speed as a result. Another straight forward extension of the current work is to use other features in addition to oriented edges. We only need to be able to produce a binary feature map, indicating the occurrences of the feature in an image. For example we could use corners, blobs or other interest points. We could also use the visual words in a visual code book (bag-of-words) as features. This would allow modeling of object appearance and shape in the same framework, but would only be practical for small code books.
Object Class Detection Using Boosted Configurations of Oriented Edges
13
We will investigate thoroughly how the performance of the algorithm is affected by constraining the number of features per weak classifier and we will also investigate the parameter estimation method presented in section 6.2 and the possibility of using it for estimating other parameters of a detected object, like the pose or viewpoint.
11
Conclusions
In this paper we have presented a novel shape-based object class representation. We have shown experimentally that this representation yields very good detection performance on a commonly used database. The advantages of the presented method compared to its competitors are that (1) it is generic and does not make any strong assumptions about for example the existence of discriminative parts or the detection of connected edge segments, (2) it is fast and (3) it is able to represent categories with significant intra-class variation. The method can also be extended to use other types of features in addition to the oriented edge features. The main drawback of the method is that, since its inductive bias is rather weak, it needs a comparatively large training set (at least about 100 training exemplars for the investigated dataset).
Acknowledgement This work was supported by the Swedish Foundation for Strategic Research (SSF) project VINST.
References 1. Ferrari, V., Tuytelaars, T., Van Gool, L.: Object detection by contour segment networks. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 14–28. Springer, Heidelberg (2006) 2. Maji, S., Malik, J.: Object detection using a max-margin. In: Proc. of the IEEE Computer Vision and Pattern Recognition (2009) 3. Gavrila, D.M.: A bayesian, exemplar-based approach to hierarchical shape matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (2007) 4. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 509–522 (2002) 5. Ferrari, V., Jurie, F., Schmid, C.: Accurate object detection with deformable shape models learnt from images. In: Proc. of the IEEE Computer Vision and Pattern Recognition (2007) 6. Thuresson, J., Carlsson, S.: Finding object categories in cluttered images using minimal shape prototypes. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 1122–1129. Springer, Heidelberg (2003) 7. Carlsson, S.: Order structure, correspondence, and shape based categories. In: Forsyth, D., Mundy, J.L., Di Ges´ u, V., Cipolla, R. (eds.) Shape, Contour, and Grouping 1999. LNCS, vol. 1681, p. 58. Springer, Heidelberg (1999)
14
O. Danielsson and S. Carlsson
8. Shotton, J., Blake, A., Cipolla, R.: Contour-based learning for object detection. In: Proc. of the International Conference of Computer Vision (2005) 9. Opelt, A., Pinz, A., Zisserman, A.: A boundary-fragment-model for object detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 575–588. Springer, Heidelberg (2006) 10. Freund, Y., Schapire, R.E.: A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14, 771–780 (1999) 11. Viola, P.A., Jones, M.J.: Robust real-time face detection. International Journal of Computer Vision 57, 137–154 (2004) 12. Breu, H., Gil, J., Kirkpatrick, D., Werman, M.: Linear time euclidean distance transform algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence 17, 529–533 (1995) 13. Fleuret, F., Geman, D.: Coarse-to-fine face detection. International Journal of Computer Vision 4, 85–107 (2001) 14. Wu, Y.N., Si, Z., Fleming, C., Zhu, S.C.: Deformable template as active basis. In: Proc. of the International Conference on Computer Vision (2007) 15. Danielsson, O., Carlsson, S., Sullivan, J.: Automatic learning and extraction of multi-local features. In: Proc. of the International Conference on Computer Vision (2009) 16. Viola, P.A., Jones, M.J.: Fast and robust classification using asymmetric adaboost and a detector cascade. In: Proc. of Neural Information Processing Systems, pp. 1311–1318 (2001) 17. Fayyad, U.M.: On the Induction of Decision Trees for Multiple Concept Learning. PhD thesis, The University of Michigan (1991) 18. Ravishankar, S., Jain, A., Mittal, A.: Multi-stage contour based detection of deformable objects. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 483–496. Springer, Heidelberg (2008) 19. Zhu, Q., Wang, L., Wu, Y., Shi, J.: Contour context selection for object detection: A set-to-set contour matching approach. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 774–787. Springer, Heidelberg (2008) 20. Schindler, K., Suter, D.: Object detection by global contour shape. Pattern Recognition 41, 3736–3748 (2008) 21. Stark, M., Goesele, M., Schiele, B.: A shape-based object class model for knowledge transfer. In: Proc. of International Conference on Computer Vision (2009) 22. Ommer, B., Malik, J.: Multi-scale object detection by clustering lines. In: Proc. of International Conference on Computer Vision (2009)
Unsupervised Feature Selection for Salient Object Detection Viswanath Gopalakrishnan, Yiqun Hu, and Deepu Rajan School of Computer Engineering Nanyang Technological University, Singapore
Abstract. Feature selection plays a crucial role in deciding the salient regions of an image as in any other pattern recognition problem. However the problem of identifying the relevant features that plays a fundamental role in saliency of an image has not received much attention so far. We introduce an unsupervised feature selection method to improve the accuracy of salient object detection. The noisy irrelevant features in the image are identified by maximizing the mixing rate of a Markov process running on a linear combination of various graphs, each representing a feature. The global optimum of this convex problem is achieved by maximizing the second smallest eigen value of the graph Laplacian via semi-definite programming. The enhanced image graph model, after the removal of irrelevant features, is shown to improve the salient object detection performance on a large image data base with annotated ‘ground truth’.
1
Introduction
The detection of salient objects in images has been a challenging problem in computer vision for more than a decade. The task of identification of salient parts of an image is essentially that of choosing a subset of the input data which is considered as most relevant by the human visual system (HVS). The ‘relevance’ of such a subset is closely linked to two different task processing mechanisms of the HVS - a stimulus driven ‘bottom-up’ processing and a context dependent ‘top-down’ processing. The information regarding the salient parts in an image finds application in many computer vision tasks like content based image retrieval , intelligent image cropping to adapt to small screen devices, content aware image and video resizing [1, 5, 20, 21]. Salient region detection techniques include local contrast based approaches modeling the ‘bottom-up’ processing of HVS [15, 25], information theory based saliency computation [4], salient feature identification using discriminative object class features [7], saliency map computation by modeling color and orientation distributions [9] and using the Fourier spectrum of images [11, 14]. Graph based approaches for salient object detection have not been explored much although they have the advantage of capturing the geometric structure of the image when compared to feature-based clustering techniques. The limited major works in this regard include the random walk based methods in [12] and R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 15–26, 2011. c Springer-Verlag Berlin Heidelberg 2011
16
V. Gopalakrishnan, Y. Hu, and D. Rajan
[8]. In [12] edge strengths are used to represent the dissimilarity between two nodes and the most frequently visited node during the random walk is labeled as the most dissimilar node in a local context. In [8], random walks are performed on a complete and a sparse k-regular graph to capture the global and local properties of the most salient node. In [17] Liu et al. proposed a supervised learning approach in which 1000 annotated images were used to train a salient region detector which was tested on another 4000 images. We use the same database in this work to demonstrate the effectiveness of the proposed method. Feature selection, which is essentially selecting a subset of features from the original feature set (consequently representing an image in lower dimension), is quite important in many classification problems like face detection that involve high dimensional data. There has not been much work reported on unsupervised feature selection in contrast to supervised feature selection as the former is a far more challenging problem due to the lack of any label information. Feature selection methods can be generally classified as filter, embedded and wrapper methods depending on how the feature selection strategy is related to the classification algorithm that follows. He et.al[13] and Mitra et.al [18] discuss unsupervised ‘filter’ methods of feature selection in which the feature selection strategy is independent of the classification algorithm. The problem of simultaneously finding a classification and a feature subset in an unsupervised manner is discussed in the ‘embedded’ feature selection methods in [16] and [26]. Law et.al [16] use a clustering algorithm based on Gaussian Mixture Models along with the feature selection, while Zhao et.al [26] optimize an objective function to find a maximum margin classifier on the given data along with a set of feature weights that favor the features which most ‘respect’ the data manifold. In [24], a ‘wrapper’ strategy for feature selection is discussed in which the features are selected by optimizing the discriminative power of the classification algorithm used. Feature selection plays a vital role in salient object detection as well, like in many other pattern recognition problems. However, choosing the right mixture of features for identifying the salient parts of an image accurately is one problem that has not received much attention so far. For example, graph based methods like [8] give equal weight to all features while the saliency factor may be attributed only to a smaller subset of features. The relevancy of different features in different images is illustrated in Figure 1. The saliency of the image in Figure 1(a) is mainly derived from the color feature, in Figure 1(b) from texture features while the salient parts in Figure 1(c) is due to some combination of color and texture features. Hence it can be concluded that some of the features used in [8] are irrelevant and only add noise to the graph structure of the image thereby reducing the accuracy of the salient object detection. In this context, we propose an unsupervised filter method for identifying irrelevant features to improve the performance of salient region detection algorithms.
Unsupervised Feature Selection for Salient Object Detection
(a)
(b)
17
(c)
Fig. 1. Example images illustrating the relevancy of different features to saliency. The relevant salient features are (a) color (b) texture (c) some combination of color and texture features.
The major contributions of this paper can be summarized as follows: – We relate the problem of identifying irrelevant features for salient object detection to maximizing the mixing rate of a Markov process running on a weighted linear combination of ‘feature graphs’. – A semi-definite programming formulation is used to maximize the second smallest eigen value of the graph Laplacian to maximize the Markov process mixing rate. – The effectiveness of the proposed unsupervised filter method of feature selection (elimination) method is demonstrated by using it in conjunction with a graph based salient object detection method [8] resulting in improved performance on a large annotated dataset. The paper is organized as follows. Section 2 reviews some of the theory related to the graph models, Markov process equilibrium and the mixing rate of Markov processes. In Section 3 we discuss the relation between irrelevant feature identification and maximization of mixing rate of Markov process. The detailed optimization algorithm is also provided in Section 3. Section 4 details the experiments done and the comparative results showing performance improvement. Finally conclusions are given in Section 5.
2
Graph Laplacian and Mixing Rate of Markov Process
In this section we review some of the theory related to graph Laplacians and mixing rate of a Markov process running on the graph. Consider an undirected graph, G(V, E), where V is the set of n vertices and E is the set of edges in the graph. The graph G is characterized by the affinity matrix A where wij , i = j Aij = (1) 0, i = j. wij is the weight between node i and node j based on the affinity of features on the respective nodes.
18
V. Gopalakrishnan, Y. Hu, and D. Rajan
The Laplacian matrix L of the graph G is defined as L=D−A
(2)
D = diag {d1 , d2 ...dn } , di = wij ,
(3)
where
j
and di is the degree of node i and is the total weight connected to node i in the graph G. Now, consider a symmetric Markov process running on the graph G(V, E). The state space of the Markov process is the vertex (node) set V , and the edge set E represent the transition strengths. The edge eij connecting nodes i and j is associated with a transition rate wij which is the respective weight entry in the affinity matrix, A, of the graph G(V, E). This implies that the transition from node i to node j is faster when the affinities between the respective nodes are larger. Consequently, a Markov process entering node i at any specific time t will take more time to move into the nodes that possess lesser affinities with node i. The mixing time of a Markov process is the time taken to reach the equilibrium distribution π starting from some initial distribution π(0). Let π(t) ∈ Rn be the n dimensional vector showing probability distributions on the n states at time t for a symmetric Markov process starting from any given distribution on the graph. The evolution of π(t) is given by [19], dπ(t) = Q.π(t) dt where Q is the transition rate matrix of the Markov process defined as i = j qij , Qij = − j,j=i qij , i = j.
(4)
(5)
qij is the transition rate or rate of going from state i to state j. As explained previously, we equate this rate to the affinity between nodes i and j, wij . Hence the Laplacian matrix L and the transition rate matrix Q is related as Q = −L. Consequently, the solution for the distribution π(t) at time t starting from the initial condition π(0) is given as π(t) = e−tL π(0).
(6)
The Laplacian L is a symmetric and positive semi-definite matrix. The eigen values of L are such that λ1 = 0 ≤ λ2 ≤ .......... ≤ λN . λ2 is also known as the algebraic connectivity of the graph G. If the graph G is connected, i.e. any node of the graph is accessible from any other node through the edge set E, then λ2 > 0 [6]. The probability distribution π(t) will tend to the equilibrium distribution π as t → ∞. The total variation distance (maximum difference
Unsupervised Feature Selection for Salient Object Detection
19
in probabilities assigned by two different probability distributions to the same event) between distributions π(t) and π is upper bounded by λ2 as [22] sup π(t) − πtv ≤ (1/2)n0.5 e−λ2 t
(7)
where .tv is the total variance distance. Equation (7) implies that a larger value of λ2 will result in a tighter bound on the total variation distance between π(t) and π and consequently result in a faster mixing of the Markov process. The second smallest eigen value λ2 can be expressed as a function of some Laplacian matrix L as [3] (8) λ2 (L) = inf xT Lx | x = 1, 1T x = 0 where 1 is the vector with all ones. As L1 = 0, 1 is the eigen vector of the smallest eigen value, λ1 = 0, of L. Equation (8) implies that λ2 is a concave function of L, as it is the point wise infimum of a family of linear functions of L. Hence maximization of λ2 as a function of the Laplacian matrix L is a convex optimization problem for which a global maximum is guaranteed.
3
Feature Graphs and Irrelevant Features
We now relate the maximization of Markov process mixing rate to the identification of irrelevant features for salient object detection. In salient region detection methods like [8, 12] that use multiple features, all features are given equal importance. Our method for unsupervised feature selection aims to identify the important features by representing the image with several ‘feature graphs’, which implies that each feature is encoded into a separate image graph model. The nodes in the ‘feature graphs’ represents 8 × 8 patches in the image similar to [8]. The affinity between nodes in a feature graph depends on the respective features. If there are M features available for every node in the image, we have M different feature graphs in which the affinity between node i and node j for the mth feature is defined as equation (1) in which m
m = e−||(fi wij
−fjm ||2
,
(9)
where fim and fjm are the values of mth feature on node i and j respectively. It is evident that the connectivity in the M feature graphs will be different. The relevant features or the features that contribute to the salient parts of the image will have feature graphs with ‘bottlenecks’. The term ‘bottleneck’ implies a large group of edges which show very less affinity between two groups (clusters) of nodes. Such bottlenecks arise from the fact that the feature values for a certain feature on the salient part of an image will be different from the feature values on a non-salient part like the background. For example, in Figure 2(a), if the feature graph is attributed to the color feature, then the nodes connected by thick lines (indicating strong affinity) fall into two groups, each having a different color value. One group might belong to the salient object and the other to the
20
V. Gopalakrishnan, Y. Hu, and D. Rajan
background and both of them are connected by thin edges (indicating weak affinity). This group of thin edges signifies the presence of a bottleneck. This phenomenon of clustering of nodes does not occur in feature graphs attributed to irrelevant features and hence such graphs will not have bottlenecks. This is illustrated in Figure 2(b) for the same set of nodes as in Figure 2(a), but for a different feature. Here the affinities for the irrelevant feature vary smoothly (for example, consider the shading variations of color in Figure 1(b)) and hence do not result in a bottleneck in its respective feature graph. Therefore, a Markov process running on an irrelevant feature graph mixes faster compared to a Markov process in a relevant feature graph since in the relevant feature graph the transition time between two groups of nodes connected by very weak edge strengths will be always high. Now consider all possible graph realizations resulting as a linear combination of the feature graphs. When we maximize the Markov mixing rate of a linear combination of these feature graphs, what we are essentially doing is to identify the irrelevant features that does not form any kind of bottleneck.
(a)
(b)
Fig. 2. Feature graphs with the same set of nodes showing (a) bottleneck between two groups of nodes (b) no bottleneck. The thicker edges indicate larger affinity between nodes.
3.1
Irrelevant Feature Identification
We adopt a semi-definite programming method similar to [22] for maximizing the second smallest eigen value of the Laplacian. However, in [22] a convex optimization method is proposed to find the fastest mixing Markov process among all possible edge weights of the graph realization G(V, E) to give a generic solution. In our proposed method, searching the entire family of graph realizations that gives the fastest mixing rate will not give us the required specific solution. Hence we restrict our search to all possible linear combinations of the feature graphs. The maximization problem of the second smallest eigen value is formulated as maximize subject to
L=
k
λ2 (L) ak .Lk ,
ak = 1
k
ak ≥ 0, ∀k
.
(10)
Unsupervised Feature Selection for Salient Object Detection
21
where Lk is the Laplacian matrix of the k th feature graph and ak is the weight associated with the respective feature graph. Now we establish that the constraints on matrix L in Equation (10) make it a valid candidate for a Laplacian matrix. matrices of M feature graphs and if Lemma 1. If L1 ,L2 ,. . . ,LM are Laplacian L = k ak .Lk under the conditions k ak = 1 and ak ≥ 0, then L is also a Laplacian matrix. Lij = = Proof. j k ak .Lkij k j ak .Lkij j = a L = 0
k k ij k j Also Lij = k ak .Lkij is negative for i = j as ak ≥ 0 and Lkij ≤ 0 ∀k and for i = j. Similarly all diagonal elements of L are positive proving that L represents the Laplacian matrix of some graph resulting from the linear combination of feature graphs. The optimization problem in Equation (10) can be equivalently expressed as maximize s subject to sI ≺ L + β11T L= ak .Lk , ak = 1, ak ≥ 0, ∀k, β > 0 k
(11)
k
where the symbol ≺ represents matrix inequality. The equivalent optimization problem maximizes a variable s subject to some matrix inequality constraints and some linear equality constraints. The term β11T removes the space of matrices with the eigen vector 1 and eigen value 0 from the space of matrices L [2]. Hence maximizing the second smallest eigen value of L is equivalent to maximizing the smallest eigen value of L + β11T . Lemma 2. The maximum s that satisfies the matrix inequality constraint in Equation (11) is the minimum eigen value of the matrix variable L + β11T Proof. Let P = L + β11T . The smallest eigen value of P can be expressed as , T λmin = inf( xxTPxx ). Assume P ≺ λmin I. This implies that, P − λmin I is negative definite. ⇒ xT (P − λmin I)x < 0 for any x, ⇒ xT P x < λmin .xT x, T ⇒ xxTPxx < λmin T which is false since the minimum value of xxTPxx is λmin .
Hence P λmin I and the maximum s that satisifies the constraint P sI is λmin . The inequalities in Equation (11) can also be expressed as
Since L = as
k
sI − (L + β11T ) ≺ 0.
(12)
ak .Lk , the matrix inequality in Equation (12) can be expressed
sI + a1 (−L1 ) + a2 (−L2 ) + ....aM (−LM ) + β(−11T ) ≺ 0.
(13)
22
V. Gopalakrishnan, Y. Hu, and D. Rajan
which is a standard Linear Matrix Inequality (LMI) constraint in an SDP problem formulation [3] as all the matrices in Equation (13) are symmetric. The Equation (11) can be seen as the minimization of a linear function subject to an LMI and linear equality constraints and can be effectively solved by standard SDP solvers like CVX or SeDuMi. Our proposed algorithm for finding the weights of different features can be summarized as follows: 1) Calculate the feature graph of every feature with the n × n affinity matrix Am for mth feature according to Equations (9) and (1.) 2) Calculate the Laplacian matrices of M features L1 , L2 , ...LM according to Equations (2) and (3) 3) Formulate the SDP problem in Equation (11) with input data as L1 , L2 , ...LM and variables as s, β, L, a1 , a2 , ..., aM and solve for a1 , a2 , ...aM using a standard SDP solver. 4) Find the feature weights as r = 1 − a and normalize the weights.
4
Experiments
The effectiveness of the proposed algorithm for unsupervised feature selection is experimentally verified by using it in conjunction with the random walk based method for salient object detection described in [8], where a random walk is performed on a simple graph model for the image with equal weight for all features. The proposed method will provide the feature weights according to feature relevancy resulting in a much more accurate graph model in the context of salient object detection. The final image graph model is decided by the graph affinity matrix A, which is evaluated as ri .Ai (14) A= i
where r1 , r2 , ...rM are the final feature weights evaluated in Section 3.1 and A1 , A2 , ...AM are the feature affinity matrices. The proposed feature selection method falls in the class of unsupervised filter method of feature selection where the feature selection is independent of the classification algorithm that follows. Hence we compare the proposed method with two unsupervised filter methods of feature selection explained in [13] and [18].The Laplacian score based feature selection method [13] assigns a score to every feature denoting the extent to which each feature ‘respects’ the data manifold. In [18], an unsupervised filter method is proposed to remove redundant features by evaluating a feature correlation measure for every pair of features. The features are clustered according to their similarity and the representative feature of each set is selected starting from the most dissimilar feature set. However, as our proposed method weighs different features according to their importance, we use the feature correlation measure in [18] to weigh the features while comparing with the proposed method. We used the same dataset of 5000 images used in [8] which was originally provided by [17]. The dataset contains user annotations as rectangular boxes
Unsupervised Feature Selection for Salient Object Detection
23
around the most salient region in the image. The performance of the algorithm is objectively evaluated with the ground truth available in the data set. We use the same seven dimensional feature set used in [8]. The features are color (Cb and Cr) and texture (orientation complexity values at five different scales). According to our algorithm, there will be seven different feature graphs (transition matrices), one for each feature. The optimal weights of features are obtained as explained in Section 3.1 with semi-definite programming in Matlab using the cvx package [10]. Using the optimal feature weights, a new graph model for the image is designed according to Equation (14), which is used with [8] to detect the salient regions.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Comparison of saliency maps for [8] with the proposed feature selection. (a),(d) Original Images. (b),(e) Saliency maps using [8]. (c),(f) Saliency maps of [8] with the proposed feature selection.
Figure 3 shows saliency maps on some example images for which the original algorithm in [8] has resulted in wrong or partially correct saliency maps. The original images are shown in figure 3(a), 3(d) and figures 3(b) and 3(e) show the saliency map generated by [8], in which the image graph model is designed with equal importance to all features, whether relevant or irrelevant . Figure 3(c)and 3(f) shows the saliency maps resulting from the proposed feature selection method used with [8]. It can be seen that the saliency maps are more accurate in Figure 3(c) and 3(f) where the features are weighted according to their relevance. For example, consider the first image in Figure 3(a). The color feature plays a good role in saliency for this image and the proposed algorithm
24
V. Gopalakrishnan, Y. Hu, and D. Rajan Table 1. F-measure comparison for different methods Method
F-Measure
Random walk method [8] [8] with Laplacian score [13] based feature selection [8] with correlation based feature selection [18] [8] with proposed unsupervised feature selection
0.63 0.68 0.68 0.7
detects the salient region more accurately by giving more weightage to color feature. On the contrary, for the eagle image in the second row of 3(d), the color features do not have much role in deciding the salient region and the proposed algorithm gets a more accurate saliency map by giving more priority to the texture feature. In most of the real life images it is some specific mixture of various features ( e.g. a combination of color and orientation entropy feature at some scale) that proves to be relevant in the context of salient object detection. The proposed algorithm intends to model this rightly weighted combination of features as can be testified from the results for other images in Figure 3. It has to be also noted that, though our proposed method assigns soft weights to different features according to their relevance, a hard decision on the features as in a selection of a feature subset can be made by applying some heuristics to the soft weights (for example, select the minimum number of features that satisfies 90% of the total soft weights assigned). The ground truth available for the dataset allows us to objectively compare various algorithms using precision, recall and F-measure. Precision is calculated as ratio of the total saliency in the saliency map captured inside the user annotated rectangle (sum of intensities inside user box) to the total saliency computed for the image (sum of intensities for full image). Recall is calculated as the ratio of the total saliency captured inside the user annotated window to the area of the user annotated window. F-measure, which is the harmonic mean of precision and recall, is the final metric used for comparison. Table 1 lists the F-measure computed for various feature selection methods on the entire dataset. We compare the proposed unsupervised feature selection algorithm with the Laplacian score based feature selection algorithm [13] and the correlation based feature selection algorithm [18]. It can be seen from the table that the proposed method significantly improves the performance of the random walk method and is better than the methods in [13] and [18]. It has to be noted that the methods proposed by [17] and [23] claims better fmeasure than the feature selection combined graph based approach on the same data set. However [17] is a supervised method in which 1000 annotated images were used to train a salient region detector which was tested on the rest of 4000 images. In [23], after obtaining the saliency map, a graph cut based segmentation algorithm is used to segment the image. Some segments with higher saliency are combined to form final saliency map. The accuracy of [23] depends heavily on the accuracy of segmentation algorithm and on the techniques used to combine different segments to obtain the final salient object. Hence as an unsupervised
Unsupervised Feature Selection for Salient Object Detection
25
and independent method (like [25] and [11]) that provides a good f-measure on a large dataset, the graph based approach is still attractive and demands attention for future enhancements.
5
Conclusion
We have proposed an unsupervised feature selection method for salient object detection by identifying the irrelevant features. The problem formulation is convex and is guaranteed to achieve the global optimum. The proposed feature selection method can be used as a filter method to improve the performance of salient object detection algorithms. The power of the proposed method is demonstrated by using it in conjunction with a graph based algorithm for salient object detection and demonstrating improved performance.
References 1. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph. 26(3) (2007) ISSN: 0301–0730 2. Boyd, S.: Convex Optimization of Graph Laplacian Eigenvalues. In: Proceedings International Congress of Mathematicians, vol. 3, pp. 1311–1319 (2006) 3. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 4. Bruce, N.D., Tsotsos, J.K.: Saliency Based on Information Maximization. In: NIPS, pp. 155–162 (2005) 5. Chen, L.Q., Xie, X., Fan, X., Ma, W.Y., Zhang, H.J., Zhou, H.Q.: A visual attention model for adapting images on small displays. Multimedia Syst. 9, 353–364 (2003) 6. Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society, Providence (1997) 7. Gao, D., Vasconcelos, N.: Discriminant Saliency for Visual Recognition from Cluttered Scenes. In: NIPS, pp. 481–488 (2004) 8. Gopalakrishnan, V., Hu, Y., Rajan, D.: Random walks on graphs to model saliency in images. In: CVPR, pp. 1698–1705 (2009) 9. Gopalakrishnan, V., Hu, Y., Rajan, D.: Salient Region Detection by Modeling Distributions of Color and Orientation. IEEE Transactions on Multimedia 11(5), 892–905 (2009) 10. Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming (web page and software) (2009), http://stanford.edu/~ boyd/cvx 11. Guo, C., Ma, Q., Zhang, L.: Spatio-temporal Saliency Detection Using Phase Spectrum of Quaternion Fourier Transform. In: CVPR (2008) 12. Harel, J., Koch, C., Perona, P.: Graph-Based Visual Saliency. In: NIPS, pp. 545– 552 (2006) 13. He, X., Cai, D., Niyogi, P.: Laplacian Score for Feature Selection. In: NIPS (2002) 14. Hou, X., Zhang, L.: Saliency Detection: A Spectral Residual Approach. In: CVPR (2007) 15. Itti, L., Koch, C., Niebur, E.: A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
26
V. Gopalakrishnan, Y. Hu, and D. Rajan
16. Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous Feature Selection and Clustering Using Mixture Models. IEEE Trans. Pattern Anal. Mach. Intell. (2004) 17. Liu, T., Sun, J., Zheng, N., Tang, X., Shum, H.Y.: Learning to Detect A Salient Object. In: CVPR (2007) 18. Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised Feature Selection Using Feature Similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 301–312 (2002) 19. Norris, J.: Markov Chains. Cambridge University Press, Cambridge (1997) 20. Santella, A., Agrawala, M., DeCarlo, D., Salesin, D., Cohen, M.F.: Gaze-based interaction for semi-automatic photo cropping. In: CHI, pp. 771–780 (2006) 21. Stentiford, F.: Attention based Auto Image Cropping. In: The 5th International Conference on Computer Vision Systems (2007) 22. Sun, J., Boyd, S., Xiao, L., Diaconis, P.: The Fastest Mixing Markov Process on a Graph and a Connection to a Maximum Variance Unfolding Problem. SIAM Review 48, 681–699 (2004) 23. Valenti, R., Sebe, N., Gevers, T.: Image Saliency by Isocentric Curvedness and Color. In: ICCV (2009) 24. Volker, R., Tilman, L.: Feature selection in clustering problems. In: NIPS (2004) 25. Walther, D., Koch, C.: Modeling attention to salient proto-objects. Neural Network 19, 1395–1407 (2006) 26. Zhao, B., Kwok, J., Wang, F., Zhang, C.: Unsupervised Maximum Margin Feature Selection with Manifold Regularization. In: CVPR (June 2009)
MRF Labeling for Multi-view Range Image Integration Ran Song1 , Yonghuai Liu1 , Ralph R. Martin2 , and Paul L. Rosin2 1 2
Department of Computer Science, Aberystwyth University, UK {res,yyl}@aber.ac.uk School of Computer Science & Informatics, Cardiff University, UK {Ralph.Martin,Paul.Rosin}@cs.cardiff.ac.uk
Abstract. Multi-view range image integration focuses on producing a single reasonable 3D point cloud from multiple 2.5D range images for the reconstruction of a watertight manifold surface. However, registration errors and scanning noise usually lead to a poor integration and, as a result, the reconstructed surface cannot have topology and geometry consistent with the data source. This paper proposes a novel method cast in the framework of Markov random fields (MRF) to address the problem. We define a probabilistic description of a MRF labeling based on all input range images and then employ loopy belief propagation to solve this MRF, leading to a globally optimised integration with accurate local details. Experiments show the advantages and superiority of our MRF-based approach over existing methods.
1
Introduction
3D surface model reconstruction from multi-view 2.5D range image has a wide range of applications in many fields such as reverse engineering, CAD, medical imagery and film industry, etc. Its goal is to estimate a manifold surface that approximates an unknown object surface using multi-view range images, each of which essentially represents a sample of points in 3D Euclidean space, combined with knowledge about the scanning resolution and the measurement confidence. These samples of points are usually described in local, system centred, coordinate systems and cannot offer a full coverage of the object surface. Therefore, to build up a complete 3D surface model, we usually need to register a set of overlapped range images into a common coordinate frame and then integrate them to fuse the redundant data contained in overlapping regions while retain enough data sufficiently representing the correct surface details. However, to achieve both is challenging due to its ad hoc nature. On the one hand, registered range images are actually 3D unstructured point clouds downgraded from original 2.5D images. On the other hand, scanning noise such as unwanted outliers and data loss typically caused by self-occlusion, large registration errors and connectivity relationship loss among sampled points in acquired data often lead to a poor reconstruction. As a result, the reconstructed surface may include holes, false connections, thick and non-smooth or over-smooth patches, and artefacts. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 27–40, 2011. Springer-Verlag Berlin Heidelberg 2011
28
R. Song et al.
Hence, a good integration should be robust to registration errors and scanning noise introduced in the stages of registration and data acquisition. Once multiple registered range images have been fused into a single reasonable point cloud, many techniques [1–4] can be employed to reconstruct a watertight surface.
2
Related Work
Existing integration methods can be classified into four categories: volumetric method, mesh-based method, point-based method and clustering-based method. The volumetric method [5–8] first divides the space around objects into voxels and then fuse the data in each voxel. But the comparative studies [9, 10] show they are time-consuming, memory-hungry and not robust to registration errors and scanning noise. The mesh-based method [11–14] first employs a step discontinuity constrained triangulation and then detects the overlapping regions between the triangular meshes derived from successive range images. Finally, it reserves the most accurate triangles in the overlapping regions and reconnects all remaining triangles subject to a certain objective function such as maximising the product of the interior angles of triangles. However, since the number of triangles is usually much larger than that of the sampled points, the mesh-based methods are computationally expensive. Thus some of the mesh-based methods just employ a 2D triangulation in the image plane to estimate the local surface connectivity in the first step as computation in a 2D sub-space is more efficient. But projection from 3D to 2D may lead to ambiguities if the projection is not injective. The mesh-based methods are thus highly likely to fail in non-flat regions where no unique projection plane exists. This speeded-up strategy cannot deal with 3D unstructured point clouds either. The point-based method [15, 16] produces a set of new points with optimised locations. Due to the neglect of local surface topology, its integration result is often over-smooth and cannot retain enough surface details. The latest clustering-based method [9, 10] employs classical clustering methods to minimise the objective dissimilarity functions. It surpasses previous methods since it is more robust to scanning noise and registration errors. Nonetheless, the clustering, which measures Euclidean distances to find closest centroids, does not consider local surface details, leading to some severe errors in non-flat areas. For instance, in Fig. 1, although point A is closer to point B and thus the clustering-based method will wrongly group them together, we would rather group A with C or D to maintain the surface topology.
Fig. 1. Local topology has a significant effect on the point clustering in non-flat areas
MRF Labeling for Multi-view Range Image Integration
29
To overcome the drawbacks of the existing integration methods, we propose a novel MRF-based method. We first construct a network by a point shifting strategy in section 3. In section 4, the MRF labeling is then configured on this network where each node is formally labeled with an image number. In section 5, we employ loopy belief propagation (LBP) to solve the MRF and find the optimal label assignment. Each node will just select its closest point from the image with its assigned image number. These selected points, directly from the original data, are used for surface reconstruction. The integration achieves a global optimisation as the MRF labeling considers all image numbers for each node. Also, according to the experimental resutls shown in section 6, it effectively preserves local surface details since in the output point cloud, the points representing a local patch of the reconstructed surface are usually from the same input image due to the neighbourhood consistency of MRF. More importantly, our method can also cope with the more general input—multiple 3D unstructured point clouds.
3
MRF Network Construction for Integration
MRF describes a system based on a network or a graphical model and allows many features of the system of interest to be captured by simply adding appropriate terms representing contextual dependencies into it. Once the network has been constructed, contextual dependencies can be cast in the manipulations within a neighbourhood. In a MRF network, the nodes are usually pixels. For example, in [17–20], the network is the 2D image lattice and 4 neighbouring pixels are selected to form a neighbourhood for each pixel. But it is difficult to define a network for multiple registered range images since the concept ‘pixel’ does not exist. The simplest method is to produce a point set by k-means clustering and the neighbours of a point can be found by Nearest Neighbours (NNs) algorithm. Since both k-means clustering and NNs are based only on Euclidean distance, in such a network neither the delivery of surface topological information nor the compensation for pairwise registration errors can be achieved. In this paper, we propose a novel scheme to produce a MRF network Inet . Given a set of consecutive range images I1 , I2 , . . . , Im , we first employ the pairwise registration method proposed in [21] to obtain a transform H12 mapping I1 into the coordinate system of I2 . The transformed image I1 and the reference image I2 supply not only the redundant surface information (overlapping area) but also the new surface information (non-overlapping area) from their own viewpoints for the fused surface. To integrate I1 and I2 , the overlapping area and non-overlapping area have to be accurately and efficiently detected. In the new scheme, we declare that a point from one viewpoint belongs to the overlapping area if its distance to its corresponding point from the other viewpoint is within a threshold, otherwise it belongs to the non-overlapping area. The corresponding point is found by a closest point search with k-D tree speedup. The threshold is set as 3R where R is the scanning resolution of the input range images. After the overlapping area detection, we initialise Inet as: Inet = Snon−overlap ,
Snon−overlap = S1 + S2
(1)
30
R. Song et al.
where S1 from I1 and S2 from I2 are the points in the non-overlapping area. Then, each point in the overlapping area is shifted by half of its distance to its corresponding point along its normal N towards its corresponding point. → − → − P = P + 0.5d · N, d = P · N and P = Pcor − P (2) where Pcor is the corresponding point of P . After such a shift, the corresponding points are closer to each other. For each point shifted from a point in the overlapping region of the reference image I2 , a sphere with a radius r = M × R (M is a parameter controlling the density of the output point cloud) is defined. If some points fall into this sphere, then their original points, without shifting, are retrieved. Their averages Soverlap are then computed and Inet is updated as: Inet = Snon−overlap + Soverlap
(3)
So, this point shifting (1) compensates the pairwise registration errors, (2) does not change the size of overlapping area as the operation of points is along the normal, and (3) still keeps the surface topology as the shift is along the normal. Now, we consider the third input image I3 . We map the current Inet into the coordinate system of I3 through the transformation H23 . Similarly, we then do the overlapping area detection between Inet transformed from Inet and the current reference image I3 . We still update Inet by Eq. (3). In this update, Snon−overlap contains the points from Inet and I3 in non-overlapping regions and Soverlap are produced by the aforementioned point shifting strategy. We keep running this updating scheme over all input range images. The final Inet will be consistent with the coordinate system of the last input image Im . Although it is not usual, if some points in Inet only appear in Snon−overlap , we delete them from Inet . Fig. 2 shows 18 range images of a bird mapped into a single global coordinate system and the MRF network Inet produced by our new scheme. In this paper, for clarity, we use surfaces instead of point clouds when we show range data. The triangulation method employed here for surface reconstruction is the ball-pivoting algorithm [1]. In Fig. 2, it can be seen that the registered images contain a massive amount of noise such as thick patches, false connections and fragments caused by poor scanning and registration errors. The point shifting scheme compensates the errors a bit whereas the resultant point set Inet is still noisy, leading to an over-smoothed surface without enough local details. But it offers us a roughly correct manifold for estimating the ground truth surface. We also need to define a neighbourhood for each node in Inet , consistent with the MRF theory. Instead of performing a NNs algorithm to search the neighbours for a point i in such a network, we find the neighbouring triangles of i in the triangular mesh. The vertices of these triangles, excluding i, are defined as the neighbours of i. The collection of all neighbours of i is defined as the neighbourhood of i, written as N (i). In this work, to strengthen the neighbourhood consistency, we define a larger neighbourhood N (i) consisting of neighbours and neighbours’ neighbours as shown in Fig. 3. Doing so is allowed since the condition i ∈ N (j) ⇔ j ∈ N (i) is still satisfied. The superiority of this definition over the NNs method is that it reflects the surface topology and is parameter-insensitive.
MRF Labeling for Multi-view Range Image Integration
31
Fig. 2. Left: 18 range images are registered (Please note that a patch from a certain range image is visible if and only if it is the outermost surface); Right: The surface representing point set Inet produced by point shifting is usually noisy and over-smooth
Fig. 3. Left: A neighbourhood and the normalised normal vectors attached to the neighbours. Right: the expanded neighbourhood used in this work. Please note that the number of neighbours of each point in such a network can vary.
4
MAP-MRF Labeling for Integration
We denote s = {1, . . . , n} representing the points in Inet and define the label assignment x = {x1 , . . . , xn } to all points as a realisation of a family of random variables defined on s. We also define the feature space that describes the points and understand it as a random observation field with its realisation y = {y1 , . . . , yn }. Finally, we denote a label set L = {1, . . . , m}, where each label corresponds to an image number. Thus, we have {xi , yj } ∈ L, {i, j} ∈ s. An optimal labeling should satisfy the maximum a posteriori probability (MAP) criterion [22] which requires the maximisation of the posterior probability P (x|y). Considering the Bayes rule, this maximisation can be written as: (4) x∗ = arg max P (x|y) = arg max P (x)p(y|x) x
x
In a MRF, the prior probability of a certain value at any given point depends only upon neighbouring points. This Markov property leads to the HammersleyClifford theorem [22]: the prior probability P (x) satisfies a Gibbs distribution with respect to a neighbourhood N , expressed as P (x) = Z −1 ×e−(U(x)/Q) , where Z is a normalising constant and U (x) is the priori energy. Q, a global control parameter, is set to 1 in most computer vision and graphics problems [22]. Let p(y|x) be expressed in the exponential form p(y|x) = Zy−1 × e−U(y|x) , where U (y|x) is the likelihood enery. Then P (x|y) is a Gibbs distribution
32
R. Song et al.
P (x|y) = ZU−1 × e−(U(x)+U(y|x))
(5)
The MAP-MRF problem is then converted to a minimisation problem for the posterior energy U (x|y): x∗ = arg minx U (x|y) = arg minx U (x) + U (y|x) . Under the MRF assumption that the observations are mutually independent, U (y|x) can be computed by the sum of the likelihood potentials V (yi |xi ) at all points. In this work, we estimate V (yi |xi ) by a truncation function: V (yi |xi ) = min(Di (yi , xi ), F ) (6) U (y|x) = i∈s
i∈s yi ∈L\xi
where F is a constant and L \ xi denotes the labels in L other than xi , and Di (yi , xi ) = Ci (yi ) − Ci (xi )
(7)
is a distance function where Ci (l), l ∈ L denotes i’s closest point in the lth input image. In this way, we convert the estimation for the likelihood potential of different labels at point i (or the single label cost) into the measurement of the distances between i’s closest points from the input images with different labels. The truncation parameter F plays an important role here. It eliminates the effect from the input images which do not cover the area around i. The prior energy U (x) can be expressed as the sum of clique energies: V1 (xi )+ V2 (xi , xj )+ V3 (xi , xj , xk )+· · · (8) U (x) = i
i
j∈N (i)
i
j∈N (i) k∈N (i)
We only consider the unary clique energy and the binary one. Here, single-point clique V1 (xi ) are set to 0 as we have no preference which label should be better. We hope that in the output point cloud, neighbouring points are from the same input image so that the local surface details can be well preserved. Therefore, we utilize the neigbhourhood consistency of MRF to ‘discourage’ the routine from assigning different labels to two neighbouring points: V2 (xi , xj ) = λAij (xi , xj ) (9) U (x) = i
j∈N (i)
i
j∈N (i)
where λ is a weighting parameter and here we use the Potts model [23] to measure the clique energy (or the pairwise label cost): 1, xi = xj , Aij (xi , xj ) = (10) 0, otherwise Once we find the optimal x∗ which assigns a label xi to point i by minimising the posterior energy, i will be replaced with its closest point from the input image with the number xi (xi th image). Please note that in this step we do not need to run a closest point search again as this point Ci (xi ) has already been searched for (see Eq. (7)). The final output of the integration is thus a single point cloud composed of points directly from different registered input range images.
MRF Labeling for Multi-view Range Image Integration
33
Fig. 4. Left: The idea of point shifting to generate Inet . Right: The illustration of a 2-image integration based on the MAP-MRF labeling.
The labeling is actually a segmentation for the point set s. Due to Eq. (9) and (10), neighbouring points in s tend to be labeled with the same image number. The surface reconstructed from the output point cloud will consist of patches directly from different registered input images. Therefore, one advantage of the new integration method is the preservation for surface details, which avoids the error illustrated in Fig. 1. Another significant advantage is its high robustness to possible moving objects. The image patches containing moving objects are less probably selected. On the one hand, the single label cost for labeling the points in the area corresponding to a moving object with the image containing the moving object must be high in terms of Eq. (6). On the other hand, the points on the boundaries among patches always have high pairwise label costs according to Eq. (9) and (10). As a result of minimising the posterior energy, the boundaries usually bypass the areas where the moving objects appear. Here we also add a robustness parameter β in order to make the algorithm more robust to noise. When we compute the single label cost V (yi |xi ), we always carry out the robustness judgment as below: • if V (yi |xi ) < β, Go through the routine; Terminate the calculation of V (yi |xi ) and delete the • if V (yi |xi ) > β, point i from the list of s. And, β = (m − q) × F
(11)
where m is the number of input images (m = max(L)) and q is the number of input images that contain the roughly same noise if known. Generally, we set q as 2 so that random noise can be effectively reduced. We can also adjust q experimentally. If the reconstructed surface still has some undesirable noise, q should be increased. If there are some holes in the surface, we use a smaller q. In summary, Fig. 4 illustrates the idea of point shifting which is the key step to construct the MRF network in section 3 and a 2-image integration based on the MAP-MRF labeling described in details in section 4.
5
Energy Minimisation Using Loopy Belief Propagation
There are several methods which can minimise the posterior energy U (x|y). The comparative study in [24] shows that two methods, Graph-Cuts(GC) [23, 25] and LBP [19, 26], have proven to be efficient and powerful. GC is usually regarded fast
34
R. Song et al.
and has some desirable theoretical guarantee on the optimality of the solution it can find. And since we use the ‘metric’ Potts model to measure the difference between labels, the energy function can be minimised by GC indeed. However, in the future research, we plan to extend our technique by exploring different types of functions for the measurement of pairwise label cost such as the linear model and quadratic model [19]. Thereby, LBP becomes a natural choice for us. Max-product and sum-product are the two variants of LBP. The algorithm proposed in [19] can efficiently reduce their computational complexity from O(nk 2 T ) to O(nkT ) and O(nk log kT ) respectively, where n is the number of points or pixels in the image, k is the number of possible labels and T is the number of iterations. Obviously, max-product is suitable here for the minimisation. It takes O(k) time to compute each message and there are O(n) messages to be computed in each iteration. In our work, each input range image has about 104 points and typically a set of ranges images have 18 images. To reflect the true surface details of the object, the number of points in the output point cloud must be large enough. Thus if we use a point set as the label set, the number of labels would be tens of thousands. In that case LBP would be extremely time-consuming and too memory-hungry to be tractable. A MATLAB implementation would immediately go out of memory because of the huge n × k matrix needed to save the messages passing through the network. But the final output of integration must be a single point cloud. So, our MRF labeling using image numbers as labels makes it feasible to employ LBP for the minimisation. The LBP works by passing messages in a MRF network, briefed as follows: 1. For all point pairs (i, j) ∈ N , initialising message m0ij to zero, where mtij is a vector of dimension given by the number of possible labels k and denotes the message that point i sends to a neighbouring point j at iteration t. 2. For t = 1, 2, . . . , T , updating the messages as t−1 t mij (xj ) = min λAij (xi , xj ) + min(Di (yi , xi ), F ) + mhi (xi ) (12) xi
yi ∈L\xi
h∈N (i)\j
3. After T iterations, computing a belief vector for each point, bj (xj ) = min(Dj (yj , xj ), F ) + mTij (xj ) yj ∈L\xj
(13)
i∈N (j)
and then determining labels as:
x∗j = arg min bj (xj ) xj
(14)
According to Eq. (12), the computational complexity for the calculation of the message vector is O(k 2 ) because we need to minimise over xi for each choice of xj . By a simple analysis, we can reduce it to O(k). The detail is given as below: 1. Rewriting Eq. (12) as, mtij (xj ) = min λAij (xi , xj ) + g(xi ) xi
(15)
MRF Labeling for Multi-view Range Image Integration
where
g(xi ) =
min(Di (yi , xi ), F ) +
yi ∈L\xi
35
mt−1 hi (xi )
h∈N (i)\j
2. Considering two cases: xi = xj and xi = xj . (1) If xi = xj , λAij (xi , xj ) = 0, thus mtij (xj ) = g(xj ). (2) If xi = xj , λAij (xi , xj ) = λ, thus mtij (xj ) = min g(xi ) + λ xi
3. Synthesizing the two cases: mtij (xj ) = min g(xj ), min g(xi ) + λ xi
(16)
According to Eq. (16), the minimisation over xi can be performed only once, independent of the value of xj . In other words, Eq. (12) needs two nested FOR loops to compute the messages but Eq. (16) just needs two independent FOR loops. Therefore, the computational complexity is reduced from O(k 2 ) to O(k) and the memory demand in a typical MATLAB implementaion is reduced from a k × k matrix to a 1 × k vector. As for implementation, we use a multi-scale version to further speed up the LBP algorithm, as suggested in [19], which significantly reduces the total number of iterations needed to reach convergence or an acceptable solution. Briefly speaking, the message updating process is coarse-to-fine, i.e. we start LBP from the coarsest network, after a few iterations transfer the messages to a finer scale and continue LBP there, so on and so forth.
6
Experimental Results and Performance Analysis
The input range images are all downloaded from the OSU Range Image Database (http://sampl.ece.ohio-state.edu/data/3DDB/RID/index.htm). On average, each ‘Bird’ image has 9022 points and each ‘Frog’ image has 9997 points. We employed the algorithm proposed in [21] for pairwise registration. Inevitably, the images are not accurately registered, since we found the average registration error (0.30mm for the ‘Bird’ images and 0.29mm for the ‘Frog’ images) is as high as 1/2 scanning resolution of the input data. Fig. 5 shows that the input range data are registered into the same coordinate system, rendering a noisy 3D surface model with many thick patches, false connections and fragments due to local and global registration errors. A large registration error causes corresponding points in the overlapping region to move away from each other and the overlapping area detection will thus be more difficult. The final integration is likely to be inaccurate accordingly. In our tests, the truncation parameter F is set to 4 and the weighting parameter λ is set to 10. Fig. 6 and 7 show the integration results produced by existing methods and our MRF-based integration, proving that our method is more robust to registration errors and thus performs best. Now we analyse some details of the new algorithm. Fig. 8 shows the labeling result. On the integrated surface, we use different colours to mark the points assigned with
36
R. Song et al.
Fig. 5. Range images (coloured differently) are registered into the same coordinate system as the input of the integration. A patch from some range image is visible if and only if it is the outermost surface. Left: 18 ‘Bird’ images; Right: 18 ‘Frog’ images.
Fig. 6. Integration results of 18 ‘Bird’ images. Top left: volumetric method [6]. Top middle: mesh-based method [14]. Top right: fuzzy-c means clustering [10]. Bottom left: k-means clustering [9]. Bottom right: the new method.
Fig. 7. Integration results of 18 ‘Frog’ images. Top left: volumetric method [6]. Top middle: mesh-based method [14]. Top right: fuzzy-c means clustering [10]. Bottom left: k-means clustering [9]. Bottom right: the new method.
MRF Labeling for Multi-view Range Image Integration
37
Fig. 8. Left: labeling result of ‘Bird’ images. Right: labeling result of ‘Frog’ images.
Fig. 9. Convergence performance of the new method using different input images. Left: The ‘Bird’ image set; Right: The ‘Frog’ image set.
Fig. 10. Integration results of 18 ‘Bird’ images produced by different clustering algorithms. From left to right (including the top row and the bottom row): k-means clustering [9], fuzzy-c means clustering [10], the new algorithm.
Fig. 11. The triangular meshes of integrated surfaces using different methods. From left to right: volumetric [6], mesh-based method [14],point-based method [16], our method.
38
R. Song et al.
different labels, corresponding to different input range images. The boundary areas coloured with red actually represent the triangles where the three vertices are not assigned with the same label (Please note that the surface is essentially a triangular mesh). If a group of points are all assigned with the same label, the surface they define will be marked with the same colour. We find that not all input images have contribution to the final integrated surface. For the ‘Bird’ and the ‘Frog’ images, the reconstructed surfaces are composed of the points from 10 and 11 source images respectively. We plot in Fig. 9 how the LBP algorithm converges. To observe the convergence performance, in the LBP algorithm, we calculate Equ. (13) to determine the label assignment in each iteration of message passing. And then we count the number of points assigned with different labels from what they are assigned in last iteration. When the number is less than 2% of the total number of the initialisation points (Inet ), the iteration is terminated. Fig. 9 shows that the new method achieves a high computational efficiency in terms of iteration number required for convergence. Due to the different objective functions, it is very difficult to define a uniform metric such as the integration error [9, 10] for a comparative study because to maintain local surface topology, we reject the idea that the smaller the average of the distances from the fused points to the closest points on the overlapping surfaces, the more stable and accurate the final fused surface. Thus we cannot use such a metric in this paper. Nonetheless, Fig. 10 highlights the visual difference of the integration results produced by classical clustering methods (according to Fig. 6 and 7, they are superior to other existing methods) and the new method. It can be seen that our algorithm performs better, particularly in the non-flat regions such as the neck and the tail of the bird. Even so, for a fair comparison, we adopt some measurement parameters widely used but not relevant to the objective function: 1) The distribution of interior angles of triangles. The angle distribution shows the global optimal degree of triangles. The closer the interior angles are to 60◦ , the more similar the triangles are to equilateral ones; 2) Average distortion metric [27]: the distortion metric of a triangle is defined as its area divided by the √sum of the squares of the lengths of its edges and then normalised by a factor 2 3. The value of distortion metric is in [0,1]. The higher the average distortion metric value, the higher the quality of
Fig. 12. Different performance measures of integration algorithms. Left: distortion metric. Right: computational time.
MRF Labeling for Multi-view Range Image Integration
39
a surface; 3) The computational time; Fig. 11 and 12 show that the new method performs better in the sense of the distribution of interior angles of triangles, the distortion metric and the computational time. All experiments were done on a Pentium IV 2.40 GHz computer. Additionally, because there is no specific segmentation scheme involved, our method based on MRF labeling saves computational time compared with the techniques using some segmentation algorithms as a preprocessing before the integration [10].
7
Conclusion and Future Work
Clustering-based methods proved superior to other existing methods for integrating multi-view range images. It has, however, been shown that classical clustering methods lead to significant misclassification in non-flat areas as the local surface topology are neglected. In this paper, we develop a MRF model to describe the integration problem and solve it by LBP, producing better results as local surface details are well preserved. Since the novel MRF-based method considers all registered input images for each point belonging to the network Inet , the integration is globally optimised and robust to noise and registration errors. The reconstructed surface is thus geometrically realistic. Also, it is applicable to many other data sources such as 3D unstructured point clouds. However, a couple of works can still be done to improve the integration in the future. As we mentioned above, the Potts model used in this paper is merely the simplest one. So we plan to explore different types of MRF models. We also hope to reduce the accumulated registration errors. One pairwise registration can produce one equation subject to the transform matrix. If we have more locally registered pairs than the total number of images in the sequence, the system of equations will be overdetermined. Solving this linear system of equations in a least squares sense will produce a set of global registrations with minimal deviation from the set of calculated pairwise registrations. In the future work, we will develop a scheme that can make a good balance between the extra cost caused by more pairwise registrations and the reduction of error accumulation. Acknowledgments. Ran Song is supported by HEFCW/WAG on the RIVIC project. This support is gratefully acknowledged.
References 1. Bernardini, F., Mittleman, J., Rushmeier, H., Silva, C., Taubin, G.: The ballpivoting algorithm for surface reconstruction. IEEE Trans. Visual. Comput. Graph. 5, 349–359 (1999) 2. Katulakos, K., Seitz, S.: A theory of shape by space carving. Int. J. Comput. Vision 38, 199–218 (2000) 3. Dey, T., Goswami, S.: Tight cocone: a watertight surface reconstructor. J. Comput. Inf. Sci. Eng. 13, 302–307 (2003) 4. Mederos, B., Velho, L., de Figueiredo, L.: Smooth surface reconstruction from noisy clouds. J. Braz. Comput. Soc. 9, 52–66 (2004) 5. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proc. SIGGRAPH, pp. 303–312 (1996)
40
R. Song et al.
6. Dorai, C., Wang, G.: Registration and integration of multiple object views for 3d model construction. IEEE Trans. Pattern Anal. Mach. Intell. 20, 83–89 (1998) 7. Rusinkiewicz, S., Hall-Holt, O., Levoy, M.: Real-time 3d model acquisition. In: Proc. SIGGRAPH, pp. 438–446 (2002) 8. Sagawa, R., Nishino, K., Ikeuchi, K.: Adaptively merging large-scale range data with reflectance properties. IEEE Trans. Pattern Anal. Mach. Intell. 27, 392–405 (2005) 9. Zhou, H., Liu, Y.: Accurate integration of multi-view range images using k-means clustering. Pattern Recognition 41, 152–175 (2008) 10. Zhou, H., Liu, Y., Li, L., Wei, B.: A clustering approach to free form surface reconstruction from multi-view range images. Image and Vision Computing 27, 725–747 (2009) 11. Hilton, A.: On reliable surface reconstruction from multiple range images. Technical report (1995) 12. Turk, G., Levoy, M.: Zippered polygon meshes from range images. In: Proc. SIGGRAPH, pp. 311–318 (1994) 13. Rutishauser, M., Stricker, M., Trobina, M.: Merging range images of arbitrarily shaped objects. In: Proc. CVPR, pp. 573–580 (1994) 14. Sun, Y., Paik, J., Koschan, A., Abidi, M.: Surface modeling using multi-view range and color images. Int. J. Comput. Aided Eng. 10, 137–150 (2003) 15. Li, X., Wee, W.: Range image fusion for object reconstruction and modelling. In: Proc. First Canadian Conference on Computer and Robot Vision, pp. 306–314 (2004) 16. Zhou, H., Liu, Y.: Incremental point-based integration of registered multiple range images. In: Proc. IECON, pp. 468–473 (2005) 17. Wang, X., Wang, H.: Markov random field modelled range image segmentation. Pattern Recognition Letters 25, 367–375 (2004) 18. Suliga, M., Deklerck, R., Nyssen, E.: Markov random field-based clustering applied to the segmentation of masses in digital mammograms. Computerized Medical Imaging and Graphics 32, 502–512 (2008) 19. Felzenszwalb, P., Huttenlocher, D.: Efficient belief propagation for early vision. International Journal of Computer Vision 70, 41–54 (2006) 20. Diebel, J., Thrun, S.: An application of markov random fields to range sensing. In: Proc. NIPS (2005) 21. Liu, Y.: Automatic 3d free form shape matching using the graduated assignment algorithm. Pattern Recognition 38, 1615–1631 (2005) 22. Li, S.: Markov random field modelling in computer vision, 2nd edn. Springer, Heidelberg (1995) 23. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 1222–1239 (2001) 24. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for markov random fields with smoothness-based priors. Pattern Analysis and Machine Intelligence. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1068–1080 (2008) 25. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26, 147–159 (2004) 26. Yedidia, J., Freeman, W., Weiss, Y.: Understanding belief propagation and its generalizations. Techincal Report TR-2001-22, MERL (2002) 27. Lee, C., Lo, S.: A new scheme for the generation of a graded quadrilateral mesh. Computers and Structures 52, 847–857 (1994)
Wave Interference for Pattern Description Selen Atasoy1,2 , Diana Mateus2 , Andreas Georgiou1 , Nassir Navab2 , and Guang-Zhong Yang1 1
Visual Information Processing Group, Imperial College London, UK {catasoy,a.georgiou,g.z.yang}@imperial.ac.uk 2 Computer Aided Medical Procedures and Augmented Reality, Technical University of Munich, Germany {atasoy,mateus,navab}@in.tum.de
Abstract. This paper presents a novel compact description of a pattern based on the interference of circular waves. The proposed approach, called “interference description”, leads to a representation of the pattern, where the spatial relations of its constituent parts are intrinsically taken into account. Due to the intrinsic characteristics of the interference phenomenon, this description includes more information than a simple sum of individual parts. Therefore it is suitable for representing the interrelations of different pattern components. We illustrate that the proposed description satisfies some of the key Gestalt properties of human perception such as invariance, emergence and reification, which are also desirable for efficient pattern description. We further present a method for matching the proposed interference descriptions of different patterns. In a series of experiments, we demonstrate the effectiveness of our description for several computer vision tasks such as pattern recognition, shape matching and retrieval.
1
Introduction
Many tasks in computer vision require describing a structure by the contextual relations of its consitituents. Recognition of patterns where the dominant structure is due to global layout rather than its individual texture elements such as the triangle images in Figure 1a-c and peace symbols in Figure 10a, patterns with missing contours such as the Kanizsa triangle in Figure 1d and Figure 5b and patterns with large homogeneous regions such as the shape images in Figure 9, necessitate the description of contextual relations within a structure in an efficient yet distinct manner. The captured information has to be discriminative enough to distinguish between structures with small but contextually important differences and be robust enough to missing information. In this paper, we introduce a method based on wave interference that efficiently describes the contextual relations between the constituent parts of a structure. Interference of several waves that originate from different parts of a structure leads to a new wave profile, namely the interference pattern (Figure 2). This interference profile is computed as the superposition of all constituent circular waves and has two relevant properties for describing a structure. Firstly, the R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 41–54, 2011. Springer-Verlag Berlin Heidelberg 2011
42
S. Atasoy et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 1. Interference patterns created from input images (shown in the first row) with several geometric transformations (a-c) and missing contours (d) leads to the emergence of similar patterns (shown in the second row), whereas shapes with only very small variations yield very distinct interference patterns (e-j)
interference profile depends on the relative spatial arrangement of the sources that create the constituent waves. Secondly, it intrinsically captures all relationships of the constituent waves. Thus, interference phenomenon provides a natural mechanism to describe the structure of a pattern by representing the interrelations of its components. The resulting interference patterns are discriminative enough to separate different structures with only very small variation such as the examples shown in Figure 1e-j but not sensitive to a loss of information due to missing contours or to geometric transformations as illustrated in Figure 1a-d. Recognition of patterns with missing contours is an important property of human perception which has been extensively studied by Gestalt psychology [1]. This constructive property, known as “reification”, is demonstrated in Figures 1d and 5b, where a triangle is perceived although it is not explicitly delineated. This is due to the spatial configuration of the constituent contours. Gestalt theory also points out two other relevant properties of human vision, i.e., invariance and emergence. “Invariance” analyses how objects are recognized invariant of their location, orientation and illumination. It is also an active area of research in computer vision. “Emergence” property states that a pattern is not recognized by first detecting its parts and combining them to a meaningful whole, but rather emerges as a complete object ; i.e. “the whole is more than the sum of its parts”. In fact, the interference phenomenon intrinsically exhibits the emergence property. The interference pattern is produced by the superposition of two or more waves. This leads to the occurrence of new structures which are not present in any of the constituent waves nor in their simple sum but are created by the interrelations of the parts as shown in Figure 2. The contribution of our work presented in this paper is twofold: firstly, we introduce the notion of wave interference for pattern description. To this end, we create interference patterns from an input image. This leads to an emergent description of the structure in the input pattern by representing the interrelations of its constituent parts. We demonstrate that our method, called ”interference description” (ID), exhibits some of the key Gestalt properties including emergence, reification and invariance. Secondly, we use these interference patterns to
Wave Interference for Pattern Description
43
create compact descriptor vectors and show in a series of experiments how the interference patterns can be applied to several computer vision tasks such as pattern recognition, shape matching and retrieval.
2
Related Work
In general, pattern description techniques can be analyzed in three main categories. Local methods extract and describe discriminative local information such as features and patches. The descriptors are then used for object matching or pattern recognition without using their contextual relations (See [2] and references inside). Global methods, such as descriptions based on the spectrum of Laplacian [3, 4] or finite element modal [5], capture distinctive information using the whole content of the pattern. In a third category, contextual methods, the local information is analyzed within the global context. To this end, relevant local information of the parts is combined with the contextual relationships between them. For instance, Sechtman and Irani create a descriptor using patches with a certain degree of self-similarities [6], Belongie et al. compute the shape context descriptors using distance histograms for each point. These descriptors are then usually combined with a suitable matching method such as star-graph [6] or hyper-graph matching [7], or a nearest neighbour matching after aligning two shapes [8]. One of the major challenges for modelling the contextual relations in the image domain is the computational complexity which increases exponentially with the order of relationships considered. Therefore, most methods include relations up to second order (pair-wise) [6, 8] or third order [7]. Furthermore, for applications such as retrieval, shape comparison or pattern recognition, a more compact description of the whole structure is desired as the explicit computation of feature correspondences is actually not required. Another form of global description methods, the frequency analysis techniques, on the other hand, allows the consideration of more complex relations of a pattern by encoding all global information. Fourier descriptors [9], for example, project the shape contour onto a set of orthogonal basis functions. Wavelet based methods such as [10] analyze the response to wave functions that are more localized in space and frequency. Spectral analysis using spherical harmonics [11] and eigenfunctions of the Laplace-Beltrami operator (vibration modes) has been successfully applied also for representation and matching of images [4] and 3D shapes [3]. In this paper, we introduce the notion of wave interference in order to create a compact pattern description. The proposed method intrinsically captures norder relations1 without increasing the computational complexity. The novelty of our method lies in the fact that it analyses the interference, and thus the interaction, of circular waves over a range of frequencies, whereas other frequency techniques ([3–5, 9–11]) are based on the reponses at different frequencies that do not interact with each other. 1
n is the number of all individual parts modelled as sources in our method.
44
3
S. Atasoy et al.
Properties of Interference Description
In this section we introduce and discuss some of the key properties of ID which are relevant for pattern description. Modelling Contextual Relations: In ID, contextual relations emerge automatically due to the relation of individual circular waves. This has two important benefits: Firstly, it eliminates the need for explicit modelling of any order relations or for defining a set of rules. Secondly, all contextual relations of a pattern are considered without increasing the computational cost. The interference is computed by a simple addition of complex wave functions, where the computation of individual wave functions are independent. Therefore, including a new part to the pattern is simply done by adding its wave function to the sum, whereas in other contextual methods, all descriptors need to be re-computed in order to consider their relations to the newly added part. Furthermore, as the contextual relations are already included in the ID, comparison of descriptors for retrieval or recognition can be performed by simple nearest neighbor matching. This avoids the requirement of more sophisticated methods such as graph matching or non-linear optimization. Specificity and Precision: In pattern description there exists a natural tradeoff between the specificity and precision of the method. ID allows to regulate this trade-off by the choice of the number of frequencies. A structure is described by a set of interference patterns computed over a range of frequencies (explained in Section 4.3), where the higher the frequency, the more specific the description becomes. Therefore the number of frequencies, which is the only parameter of the proposed method, allows adjusting the specificity of the description depending on the application. Verification of Gestalt Properties: Gestalt psychology has played a crucial role in understanding human visual perception. It also has been an inspiring model for several computer vision applications such as analyzing shapes [12], learning object boundaries [13], defining mid-level image features [14] and modeling illusionary contour formation [15]. An extensive study of the Gestalt principles and their application in computer vision is presented in [1, 16]. In this work, we point out the analogy between the interference phenomenon and three important Gestalt principles, namely emergence, reification and invariance. In Section 5.1, we demonstrate that due to the particular nature of interference, our method intrinsically satisfies the mentioned Gestalt properties and eliminates the need for a set of defined rules or learning using a training dataset.
4
Methods
In this Section, we first introduce the background for waves and interference phenomenon. Then we explain our approach for pattern description, where an input pattern is first transformed into a field of sources; each creating a circular
Wave Interference for Pattern Description
(a)
(b)
45
(c)
Fig. 2. a) Demonstration of the emergent nature of interference. Each source in the stimulus field creates a (attenuating) circular wave on the medium resulting in the final interference pattern. b) Sum of the individual wave patterns. c) Interference of individual wave patterns. The interference phenomena intrinsically exhibits the emergence property, as the whole (the interference pattern) shown in c) is different than the sum of its constituent wave profiles shown in b).
wave; and resulting jointly in an interference pattern. Finally, we propose a method to perform comparison of the created interference patterns. 4.1
Background
In its most basic form, a wave can be defined as a pattern (or distribution) of disturbance that propagates through a medium as time evolves. Harmonic waves are waves with the simplest wave profile, namely sine or cosine functions. Due to their special waveform they create wave patterns which are not only periodic in time but also in space. Thus, on a 2D medium, harmonic waves give rise to circular wave patterns such as the ones observed on a liquid surface when a particle in dropped into the liquid. Using complex representation, a circular wave is described by the following wave function of any point x on the medium and a time instance t: ψφ (A0 , ω; x, t) = A0 · ei(ω||x−φ||) · eiωt ,
(1)
where i is the imaginary unit, ω is the frequency of the wave and φ is the location of the source (stimulus). A0 denotes the intensity of the source and ||x− φ|| gives the distance of each point x on the medium to the location of the source. As the time dependent term eiωt is not a function of space (x), it does not affect the interference of the waves. Therefore, in the rest of the paper it will be taken as t = 0. So, the instantaneous amplitude of the disturbance at each point on the medium is given by the real part of the wave function Real [ψφ (A0 , ω; x)]. For simplicity, we will denote waves with a fixed frequency ω and a fixed amplitude A0 as ψφ (x) := ψφ (A0 , ω; x). As waves propagate outwards from the source, the amplitude of the wave will gradually decrease with the increased distance from the source due to friction. This effect, called attenuation, can be described by multiplying the amplitude of the wave with the attenuation profile σ. When two or more waves ψφi are created on the same medium at the same time from several source locations φi , the resultant disturbance Ψ at any point x on the medium is the algebraic sum of the amplitudes of all constituents. Using
46
S. Atasoy et al.
complex represetation, the interference pattern Ψ (x) is given by the amplitude of the sum of individual complex wave functions ψφi (x): n (2) Ψ (x) = ψφi (x) , i=1
where | · | denotes the absolute value of the complex number. Note that the absolute value of the sum of a set of complex numbers is different than the sum of their real parts (See Figure 2). This is known as the superposition principle of the waves. The superposition of waves with fixed frequencies yields to a special phenomenon called interference. In this special case, if the waves reaching a point on the medium are in-phase (aligned in their ascent and descents), they will amplify each other’s amplitudes (constructive interference); conversely if they are out-of-phase, they will diminish each other at that point (destructive interference). This results in a new wave profile called interference pattern. Figure 2a illustrates the interference pattern created by the superposition of 3 waves with the same frequency. The interference pattern is computed by the absolute value of the sum of complex wave functions. The complex addition results in a waveform which is not present in any of the constituting waves nor in the simple sum of the contituent wave amplitudes (Figure 2b) but occurs due to their combination (Figure 2c). Therefore, interference property of waves provides an emergent mechanism for describing the relations between the sources and for propagating local information. 4.2
Interference Description
Using the introduced definitions in Section 4.1, our method formulates the contextual relations of the parts composing the image in an emergent manner. Let Ω ⊂ R2 be the medium on which the waves are created, where x ∈ Ω denotes a point on Ω. In practice the medium is described with a grid discretizing the space. The function I : Ω → R, called stimulus field, assigns a source intensity A0 = I(φ) to the positions φ on the medium Ω. This function contains the local information of the pattern which is propagated over the whole medium. Depending on the application, it can be chosen to be a particular cue defined on the image domain such as gradient magnitude, extracted edges or features. Each value I(φ) of the stimulus field I induces a circular wave ψφ (I(φ), ω; x) on the medium. We use the frequency of the wave to describe the relation between the distribution of sources and the scale at which one wants to describe the pattern. This is discussed in detail in Section 4.3. The profile of the created circular wave is described as: ψφ (I(φ), ω; x) = I(φ) · ei(ω(||x−φ||) · σ ,
(3)
where σ is the attenuation profile. We define the attenuation profile of the wave induced by the source φ with frequency ω as: σ(φ, ω; x) = ω · e−ω·||x−φ|| ,
(4)
Wave Interference for Pattern Description
(a)
47
(b)
Fig. 3. Wave patterns for different source locations with the 3rd frequency (k = 3) shown in a) and 8th frequency (k = 8) shown in b)
where ||x − φ|| denotes the Euclidean distance between a point x and the source location φ. This attenuation profile allows us to conserve the total energy of the waves while changing their frequency. Note that the attenuation profile is a function of the distance to the source (||x − φ||) as well as the frequency ω. The higher the frequency, the sharper is the decay of the attenuation profile. This yields to a localized effect for high frequencies and a more spread effect for low frequencies. Once the location and intensities of the sources on the stimulus field are determined, the interference pattern Ψ (I, ω; x) on the medium for a fixed frequency ω is computed simply by the superposition of all wave patterns: (5) ψφi (I(φi ), ω; x) . Ψ (I, ω; x) = φi ∈Ω 4.3
Multi-frequency Analysis
Computing the interference patterns for several frequencies allows for the analysis of the stimulus field at different scales. As discussed in Section 4.2, high frequency waves are more localized and propagate the content in a smaller neighbourhood, whereas low frequency waves are more spread over a larger area. Figure 3 illustrates the circular wave patterns for several frequencies and source locations. Formally, we define the interference description (ID) Θ(I) of an input I as the set of interference patterns Ψ (I, ωi ) for a range of frequencies {ω1 , ω2 , ..., ωn }: Θ(I) = {Ψ (I, ω1 ), Ψ (I, ω2 ), ..., Ψ (I, ωn )} ,
(6)
i with ωi = 2π·k P , where ki ∈ N is the wave-number, i.e. the number of periods the wave has on the medium and P is the length of the grid.
4.4
Descriptor Comparison
In this section, we present a method to compare two IDs, Θ(I1 ) and Θ(I2 ) in a rotation invariant manner. To this end, we first define a coherency measure c of an interference pattern Ψ (x):
48
S. Atasoy et al.
|Ψ (x)| φi ∈Ω |ψφi (x)|
c(Ψ (x)) =
.
(7)
The coherency measures the power of the interference pattern compared to the sum of the powers of the constituent waves. As φi ∈Ω |ψφi (x)| provides the upper limit to the power of the interference amplitude |Ψ (x)| for each location x on the medium, the coherency measure c normalizes the values of the interference profile to the interval [0, 1] (Figure 4c-d). This enables the comparison of different IDs independent of their absolute values and therefore makes the comparison independent of the number of sources. We subdivide this coherency image c(Ψ (x)) into m + 1 levelsets {γ0 , γ1 , · · · , γj , · · · , γm }
γj = {x|c(Ψ (x)) = j/m}
(8)
and take the sum of the coherency values for each interval between the two T consecutive levelsets. This yields an m-dimensional measure h = [h1, · · · , hm ] , hj = γj−1 ≤x<γj c(Ψ (x)) computed from an interference pattern Ψ (x) (Figure 4e). Performing this levelset histogram calculation for the whole range of frequencies n of an ID (Θ(I)) results in a m × n dimensional measurement h(Θ(I)) = [h(Ψ (I, ω1 )), · · · , h(Ψ (I, ωn ))] (Figure 4f). Quantifying the amount of coherency in each levelset enables a rotation-invariant description of the created interference patterns. The distance between the two IDs Θ(I1 ) and Θ(I2 ) is then simply computed as the cosine measure between the levelset histograms: h(Θ(I1 ))T · h(Θ(I2 )) dist(I1 , I2 ) = acos . (9) ||h(Θ(I1 ))|| · ||h(Θ(I2 ))|| This dissimilarity, measures the distance between the two patterns in a rotation invariant manner (due to the levelset histograms) while taking the contextual relations between the constituent parts (due to the interference pattern) for each pattern.
Fig. 4. Rotation invariant multi-frequency descriptor. a) Input image, b) sources located on the medium Ω as marked in green, c) interference patterns computed for a range of frequencies Θ(I), d) coherence measure computed on the interference patterns c(Θ(I)), e) Levelsets {γ1 , · · · , γm } and f) the levelset histogram, where the columns correspond to descriptors at different frequencies.
Wave Interference for Pattern Description
5
49
Experiments and Results
Firstly, we demonstrate that the ID exhibits the Gestalt properties discussed in Section 2. Then, we show its application for shape matching and retrieval and pattern recognition. For our experiments, we first create a grid (medium Ω : [−π × π] × [−π × π]) and map the locations of the stimulus field to the sources on the medium. Each source creates a circular wave pattern propagating outwards from this source location. Also, we define the stimulus field to be the gradient magnitude of the input image. Note that the use of more sophisticated feature extraction methods such as phase congruency [17] can further improve the performance. The choice of gradient magnitude is only due to its computational simplicity. The strength of the disturbance (A0 ) at a location x on the medium is set to be the gradient magnitude at that location. 5.1
Verification of Gestalt Properties
In this Section we demonstrate how the proposed description exhibits the Gestalt properties emergence, reification and invariance. Emergence: The emergence property states that the whole (structure) is more than the sum of its parts. This is due to the interrelations of the constituents. In ID, the whole (i.e. interference pattern as shown in Figure 2c) is given by the amplitude of the complex interference field (sum of the complex wave functions). This interference pattern is different from the sum of the amplitudes of the consituent waves (Figure 2b). This is due to the constructive and destructive interference effect (as explained in Section 4.1) and can be seen mathematically as: n n n ψφi (x) = Real[ψφi (x)] = ψφi (x) , (10) Ψ (x) = i=1
i=1
i=1
where the left side of the inequality defines the interference pattern and the right side gives the sum of the amplitudes of its constituent waves. The difference between the whole and the sum of its parts leads to the emergence of a new profile which is not present in any of the constituent circular waves nor in their simple sum but occurs due to their interrelations. Reification: Reification is the constructive property seen in human perception [1, 18]. In Figures 5b and f, a triangle and a square are perceived, although they are not explicitly delineated. This is due to the particular spatial arrangement of the packman shapes. ID results in emergence of the same interference patterns in the location where a triangle is perceived in Figures 5a and b and where a square is perceived in Figures 5c and d. Therefore, ID also allows to recognize patterns with missing contours such as the triangle image in Figure 1d.
50
S. Atasoy et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 5. Reification property seen in wave interference. Similar interference patterns emerge for a complete triangle a) and a Kanizsa triangle image b) or a complete square e) and a Kanizsa square image f). c) and d) demonstrate the interference patterns for the 4th frequency (k = 4) for the triangle and Kanizsa triangle, respectively. The source locations are illustrated in black crosses. g) and h) show the interference patterns of images in e) and f) for the 8th frequency (k = 8).
Invariance: The invariance of ID to scale change is achieved due to the mapping of the stimulus field onto the same medium (discretized grid of fixed size). This creates sources along contours of the same size, however, the image with a smaller scale results in less number of sources sampled along the contour. The structure of the created interference patterns are the same as shown in Figures 6a,b. As the proposed comparison method in Section 4.4 is independent of the number of sources, the two IDs result in very similar levelset histograms as shown in Figure 6a,b. We further demonstrate the invariance of the ID for rotation, affine and perspective transformations, as well as deformation. To this end, we simulate a binary triangle image and apply the corresponding transformations. Figure 6c-g displays for each transformation the input images, interference patterns for the same example frequency and the corresponding levelset histograms computed as described in Section 4.4. Note the similarity of the IDs and the levelset histograms despite the large geometric transformations between the images.
Fig. 6. Invariance of ID for scale change (a,b), rotation (c,d), affine transformation (c,e), perspective transformation (c,f) and deformation (c,g). First row shows the input patterns, second row illustrates interference pattern for the 6th frequency k = 6 and last row demonstrates the levelset histograms computed from the IDs h(Θ(I)).
Wave Interference for Pattern Description bone comma glas hcircle heart misk apple bat beetle bell bird bottle brick butterfly camel car carriage cattle cellular phone chicken children chopper classic crown cup deer device device1 device2 device3 device4 device5 device6 device7 device8 device9 dog elephant face fish flatfish fly fork fountain frog guitar hammer hat horse horseshoe jar key lizzard lmfish octopus pencil personal car pocket rat ray sea snake shoe spoon spring stef teddy tree truck turtle watch
51
45
40
35
30
25
20
15
10
5
cobone mm a hcglas ir hecle a m rt ap isk ple be bat etl e be bir ll bo d ttle bu bric tte k ca rfly me ca ca l rr r ce llu ciage lar att ph le o ch ne ch icke ch ildre n op n cla per s crosic w cun d p de eer v de ice device1 device2 device3 device4 device5 device6 device7 device8 vic e ele do9 ph g an fac t e fla fish tfis h fl fou fory nta k in fro ha guitag mm r er h ho ho at rse rs sh e oe jar lizzkey lm ard oc fish to pe p pu rso en s na cil pol car ck et ra se a s ra t na y s ke sphoe spoon rin s g tedtef d tr y truee tu ck wartle tch
0
Fig. 7. The confusion matrix of ID matching as applied on the MPEG7 database
5.2
Application to Shape Matching and Retrieval
In this experiment, we demonstrate the use of IDs for shape matching. We apply the presented description approach to match objects from the contour images of the MPEG7 CE shape database containing 1400 images with 70 different categories. For each image, we first compute the interference patterns for 10 frequencies. Then we compute the similarity of each shape image against all other images in the database by comparing the IDs as described in Section 4.4. Figure 7 demonstrates the performance of ID matching as a confusion matrix computed from the complete MPEG7 database. After matching the shape images, we assign the best matched category to each image for retrieval considering the 20 minimum distances. Figure 8 displays the recall and precision over the 70 categories. Figure 9a demonstrates 3 example shapes; chopper, guitar, and spoon, where chopper leads to high recall and precision in the retrieval because of its characteristic structure. Due to their high similarity, guitar is confused with spoon in the final category retrieval. However, as demonstrated in Figure 9a, the interference patterns of the two shapes still show some differences despite the significant similarity between them. The confusion is caused due to the discretized levelset histogramming as illustrated in Figure 9a last row. In order to evaluate the inter-category distances, we embed the levelset histogram vectors (Section 4.4) linearly in to the 2D space using the classical multi-dimensional scaling (MDS). Figure 9b shows the 2D MDS plot of the representatives of each category. The MDS
52
S. Atasoy et al. 1
0.8
0.6
0.4
0.2
cobon mme hcglaa ir s hecle mart ap isk pl be bae etl t bee ll bobird bu brttle tte ick ca rfly m ce car cael llu ria r lar ca ge ph ttle ch on ch ick e ch ildreen o clappen crossicr w cun p d de deevicer devic e e v de ice1 device2 device3 device4 devic 5 devicee6 device7 vi 8 ele dce9 ph og a facnt e fla fish tfis h fou fofly ntark fr in haguitog mmar ho h hear rse ors t sh e oe ja liz ke r za y l oc mfisrd pe top h rso pe us na nci pol ca l ck r e se rat a s ra t na y k spshoe spoone rin g tesdtef d tre y tru e tu ck wartle tch
0
1
0.8
0.6
0.4
0.2
cobon mme hcglaa ir s hecle mart ap isk pl be bae etl t bee ll bobird bu brttle tte ick ca rfly m ce car cael llu ria r lar ca ge ph ttle ch on ch ick e ch ildreen o clappen crossicr w cun p d de deevicer devic e e v de ice1 device2 device3 d vic 4 deevice5 devicee6 device7 vi 8 ele dce9 ph og a facnt e fla fish tfis h fou fofly ntark fr in haguitog mmar ho h hear rse ors t sh e oe ja liz ke r za y l oc mfisrd pe top h rso pe us na nci pol ca l ck r e se rat a s ra t na y k spshoe spoone rin g tesdtef d tre y tru e tu ck wartle tch
0
Fig. 8. a) Recall and b) precision values for shape category retrieval of IDs when applied to all categories in the MPEG7 shape database CHOPPER
GUITAR
SPOON 15
10
5
0
−5
−10
−15
−20
−15
−10
(a)
−5
0
5
10
15
20
(b)
Fig. 9. a) First row: input images for 3 example shapes. Second row: the corresponding interference patterns for the 5th frequency (k = 5). Third row: Levelset histograms computed with 5 levelset intervals for 10 frequencies. b) 2D MDS plot showing a representative for each category shape. Even in 2D representation similar categories are closely located whereas distinct categories are more separated.
plot demonstrates how similar shape categories locate more closely whereas more distinct structures are more separated. 5.3
Application to Pattern Recognition
In this experiment we demonstrate that ID can be used to recognize patterns created by different texture elements. Therefore, we use images with large inter-class variability in the representation where in each image the pattern is generated with different texture elements. Although the details of the patterns are different, they all share the same global layout. This global structure is enhanced in the low frequency interference patterns. Therefore we create IDs Θ(I) for k = {1, · · · , 4}. For a quantitative evaluation we compare the levelset histograms created from the IDs to the histograms of gradients and of the intensities. Figure 10b illustrates IDs created from the images shown in Figure 10a. Figure 10c demonstrates their matching levelset histograms. IDs lead to similar but distinctive patterns,
Wave Interference for Pattern Description
53
(a)
(c) ID (b)
(d) Intensities
(e) Gradient
Fig. 10. Application of ID for pattern recognition. a) Input images containing the same global layout created with different texture elements. b) IDs at the frequency (k = 3) of the images a-d). c) Levelset histograms as compared to histogram of intensities d) and gradients e). For visualization the histograms are reshaped to 5 × 4 matrices.
whereas the gradients or intensities lead to less similar histograms (Figure 10d,e) that also lack the discriminative power. The noisy background in the 4. image in Figure 10a results in high gradient values in the background. This is also reflected in its effect on the histogram of gradients (4. image in Figure 10e). Naturally, this includes several sources due to the background, as the stimulus field I(φ) is defined to be the gradient magnitude. However, as shown in Figure 10b, ID tolerates a certain degree of background noise.
6
Discussion and Conclusions
In this work, we have introduced a method for pattern description that is based on wave interference. Due to the characteistics of interference phenomenon, our method intrinsically accounts for the contextual relations between the parts of a pattern. This eliminates the need for defining a set of rules or for any prior learning. Furthermore, the particular mathematical formulations of the presented method allow for the computation of higher order relations in an efficient manner, i.e. a simple complex addition. In a series of experiments, we have showed that the proposed description is in agreement with three key Gestalt properties relevant for pattern description. Future work will direct towards extending this approach for describing a whole scene including different objects and patterns. This requires creating an interference description for each object and then describing the scene hierarchically at different scales. In this work, we particularly pointed out the analogies of the wave interference phenomenon and some desired properties of pattern description and demonstrated its efficient application for pattern recognition, shape matching and retrieval.
54
S. Atasoy et al.
Acknowledgements. This research was supported by the Graduate School of Information Science in Health (GSISH) and the TUM Graduate School. The authors would like to thank Wellcome Trust and EPSRC for funding The Centre of Excellence in Medical Engineering and Tobias Wood and Ozan Erdogan for valuable discussions.
References 1. Koffka, K.: Principles of Gestalt psychology. Routledge, New York (1999) 2. Lowe, D.: Object recognition from local scale-invariant features. In: International Conference on Computer Vision, ICCV, vol. 2, pp. 1150–1157 (1999) 3. Levy, B.: Laplace-beltrami eigenfunctions: Towards an algorithm that understand s geometry. In: IEEE International Conference on Shape Modeling and Applications, pp. 13–25 (2006) 4. Peinecke, N., Wolter, F.E., Reuter, M.: Laplace spectra as fingerprints for image recognition. Computer-Aided Design 39, 460–476 (2007) 5. Sclaroff, S., Pentland, A.: Modal matching for correspondence and recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 17, 545–561 (1995) 6. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 511–518 (2007) 7. Zass, R., Shashua, A.: Probabilistic graph and hypergraph matching. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1–8 (2008) 8. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. on Pattern Analysis and Machine Intelligence, 509– 522 (2002) 9. Zhang, D., Lu, G.: A comparative study of Fourier descriptors for shape representation and retrieval. In: Asian Conference on Computer Vision, ACCV (2002) 10. Kuthirummal, S., Jawahar, C., Narayanan, P.: Planar shape recognition across multiple views. In: International Conference on Pattern Recognition, vol. 16, pp. 456–459 (2002) 11. Kazhdan, M., Funkhouser, T., Rusinkiewicz, S.: Rotation invariant spherical harmonic representation of 3d shape descriptors. In: Eurographics/ACM SIGGRAPH Symposium on Geometry Processing, pp. 156–164 (2003) 12. Loncaric, S.: A survey of shape analysis techniques. Pattern Recognition 31, 983– 1001 (1998) 13. Dollar, P., Tu, Z., Belongie, S.: Supervised learning of edges and object boundaries. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, vol. 2 (2006) 14. Bileschi, S., Wolf, L.: Image representations beyond histograms of gradients: The role of Gestalt descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1–8 (2007) 15. Williams, L., Jacobs, D.: Stochastic completion fields: A neural model of illusory contour shape and salience. Neural Computation 9, 837–858 (1997) 16. Desolneux, A., Moisan, L., Morel, J.: From gestalt theory to image analysis: a probabilistic approach. Springer, Heidelberg (2007) 17. Kovesi, P.: Image features from phase congruency. Videre: Journal of Computer Vision Research 1, 1–26 (1999) 18. Carman, G., Welch, L.: Three-dimensional illusory contours and surfaces. Nature (1992)
Colour Dynamic Photometric Stereo for Textured Surfaces Zsolt Jank´ o1,2, , Ama¨el Delaunoy1 , and Emmanuel Prados1 1
2
INRIA Rhˆ one-Alpes, Montbonnot, France Computer and Automation Research Institute, Budapest, Hungary
[email protected]
Abstract. In this paper we present a novel method to apply photometric stereo on textured dynamic surfaces. We aim at exploiting the high accuracy of photometric stereo and reconstruct local surface orientation from illumination changes. The main difficulty derives from the fact that photometric stereo requires varying illumination while the object remains still, which makes it quite impractical to use for dynamic surfaces. Using coloured lights gives a clear solution to this problem; however, the system of equations is still ill-posed and it is ambiguous whether the change of an observed surface colour is due to the change of the surface gradient or of the surface reflectance. In order to separate surface orientation from reflectance, our method tracks texture changes over time and exploits surface reflectance’s temporal constancy. This additional constraint allows us to reformulate the problem as an energy functional minimisation, solved by a standard quasi-Newton method. Our method is tested both on real and synthetic data, quantitatively evaluated and compared to a state-of-the-art method.
1
Introduction and Related Works
1.1
Motivation
Reconstructing dynamic surfaces from images has become a popular research topic of computer vision during the last few years. Extending the time-honoured methods of static surface reconstruction to the dynamic case is reasonable. However, active techniques (laser scanner, structured light projection system, etc.) that project time-varying textures on the surface are rather difficult to apply on deformable surfaces. Passive techniques, such as conventional stereo or multicamera systems, are not biased by temporal variation, and they are more frequently applied for dynamic surfaces [1–7]. However, their accuracy is not sufficient to reconstruct small details, such as wrinkles and creases of garments. Since the seminal work of Woodham [8], photometric stereo has become a powerful technique to recover local surface orientation from brightness variations generated by the shading. In contrast to multi-camera systems, photometric
Corresponding author.
R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 55–66, 2011. Springer-Verlag Berlin Heidelberg 2011
56
Z. Jank´ o, A. Delaunoy, and E. Prados
stereo uses image sequence acquired from the same viewpoint, under differing known directional illumination conditions, and it is capable of reconstructing small details, such as surface bumpiness, as well. However, when one wishes to apply photometric stereo on dynamic surfaces, he meets a technical difficulty: how to alter lights while the object remains still. It is clear that for intensively varying surfaces switching the lights is rather inconvenient and impractical. The work of Smith and Smith [9] distinguishes between temporal and spectral multiplexing approaches based on how they isolate illumination conditions from one another. Temporal multiplexing means adaptation of static to dynamic photometric stereo, that is alternating lights. In works [10, 11] light sources are rapidly pulsed in synchronisation with the camera using a timing controller. This solution is technically difficult to attain and requires slow object motion. Additionally, for a human actor vibration of the light can be very annoying. Spectral multiplexing approaches can avoid light pulsation by using light frequency to separate illumination conditions. Here all the light sources are in play simultaneously. This category is also termed as colour photometric stereo, as coloured lights are used to recover surface shape [9, 12–14]. Our method falls within the latter category: we use three light sources with colours red, green and blue, respectively. These three colours appear almost uniformly in the visible light spectrum (400-700nm), hence they are easy to separate from one another. Furthermore, RGB is a standard colour model for sensing, representation and display of images and applied in many conventional cameras. The application of coloured lights to photometric stereo is described in more details below. 1.2
Problem Formulation
A general model of image formation is
λ=700
I(u) = λ=400
T i n(u) E(i, λ)R(i, u, λ)di S(λ)dλ Ω λ=700 T i n(u) E(i, λ)R(i, u, λ)S(λ) dλ di, (1) =
Ω
λ=400
where I(u) is the response of a camera sensor at image point u, n(u) the corresponding surface normal, and Ω ⊂ S2 is the set of all possible directions of incoming lights i (S2 being the unit sphere of R3 ). Denoting the wavelength by λ, E(i, λ) is the light’s spectral power distribution, R(i, u, λ) is the reflectance of the surface point projected to the image point u, and S(λ) is the sensor’s spectral sensitivity. In this work we assume that the scene is illuminated by three light sources (a Red one, λr = 650nm, a Green one, λg = 510nm, and a Blue one, λb = 475nm) and that it is viewed by a single camera with three sensors which mostly respond to the same wavelengths of the lights (λr , λg and λb for the red, green and blue captors, respectively). (See fig. 1.) Sensors’ sensitivity can then be
Colour Dynamic Photometric Stereo for Textured Surfaces
57
Fig. 1. The spectral sensitivities of a typical camera. From [15].
well-approximated by Dirac delta functions as Sk (λ) = sk δ(λ − λk ), k = r, g, b, with suitable constants sk . Since the sensors are sensitive to different, well-separable wavelengths, using three light sources with three different colours (according to the above, red, green and blue, respectively) means that one image is enough to separate the three lights that are necessary for photometric stereo. Based on these assumptions and approximations, the response of the three sensors, i.e. eq. (1) simplifies to Ik (u) = lTk n(u)αk (u),
k = r, g, b.
(2)
Here lk = ik E(ik , λk ) is the unit vector ik towards the light source k multiplied by its intensity1 . αk (u) = R(ik , u, λk )sk is proportional to the albedo, i.e. the fraction of incident energy reflected by the surface. As in previous works that consider the same problem, here we assume the scene to be Lambertian. Also, we will call the triplet α = (αr , αg , αb ) albedo further on. Carefully examining eq. (2) one can see that the reconstruction problem (i.e. recovering normals and albedos from the colours observed) is ill-posed. Beyond the image colour values Ik (u), vectors lk are also given, assuming calibrated light sources. The rest of the unknowns are to be recovered, namely the normal n (2 unknowns) and the albedo α (3 unknowns). On the contrary, we have only 3 equations, for k = r, g, b. (Note that in conventional photometric stereo, when surface is illuminated by three light sources separately at three different time instants, we have three colour images instead of one, and 6 additional equations. Although the normals can then easily be calculated, altering lights is not practical for dynamic surfaces.) In order to solve eq. (2) and to recover surface normals and albedos, one should add further constraints. Such a constraint can be that αr = αg = αb , i.e. the surface is white or gray; however, this assumption is rather restrictive. Another possibility is to capture an additional image under a broad spectrum white light and to initially measure reflectance properties. The light source should be located 1
For simplicity we assume directional light sources at infinite distance.
58
Z. Jank´ o, A. Delaunoy, and E. Prados
close to the camera, and acquisition must be performed at a separate time instant to avoid coupling with the RGB lights. The latter condition makes it impractical to apply for time-varying surfaces. The method of Hernandez et al. [13] works on uniformly coloured surfaces, and a calibration tool is employed to estimate this colour. In contrast to them, we do not want to restrict our method to uniform albedos, either not to apply an initial estimation of reflectance properties. The work of Smith and Smith [9] meets these requirements, as they use narrow band colour photometric stereo in the infra-red region of the spectrum. In the infra-red region the reflected light is less sensitive to surface colour changes, i.e. different colours become metameric to one another. Due to this, surface orientation and colour can be separated. However, their system is rather complicated and requires special and expensive tools. In this work we present a novel method that extends the approach of Hernandez et al. [13] to textured surfaces. The principal idea is to track surface motion and exploit the fact that surface colour, i.e. the albedo is constant through time. Based on this assumption we formulate an energy functional according to eq. (2) that penalises albedos’ temporal change and reinforces spatial and temporal smoothness. Contribution. The principal contribution of this work is a novel photometric stereo method for textured dynamic surfaces. To the best of our knowledge, our method is the first one in the category of dynamic photometric stereo which successfully works on textured surfaces as well, using only a single standard camera with a simple and practical light configuration. The problem is formulated by an energy functional and its optimisation. The method is able to reconstruct local surface orientations and to separate them from reflectance properties.
2
Proposed Solution
In this paper, we focus on a representation of dynamic surfaces as animated normal maps. An animated normal map is a sequence of normal maps where temporal correspondence is given. As discussed above, we consider a setup of one colour camera and three light sources filtered with red, green and blue filters, respectively. We denote the image sequence by It , the image domain by Dt and the image points by ut ∈ Dt , where t ∈ T means a discrete time instant, T = {0, 1, . . . }. Here u0 , u1 , u2 , . . . denote the corresponding image points, that belong to the same surface part, but not the same co-ordinates. That is, ut+1 = ut + d(ut ), where d(ut ) is a 2D displacement vector. The dynamic surface S to be reconstructed can then be represented by the normals, the albedos and the displacement vectors: S = {n(ut ), α(ut ), d(ut ) : ut ∈ Dt , t ∈ T } .
(3)
Although the normal map representation does not reflect the complete structure of the surface, it can easily be integrated to obtain a depth map and a surface
Colour Dynamic Photometric Stereo for Textured Surfaces
59
mesh. Note that recovering depth map from normals gives bad low frequency details; however, if a rough surface mesh is already given, a normal map can be used to represent fine details, small variations of the surface (e.g. creases on garments) as a bump map. 2.1
Energy Functional
The optimality of surface S is quantified by an energy functional composed of two terms. The first term forces the model to suit the photometric constraint while enforcing albedo temporal constancy and unit length normals (data term); the second term guarantees spatial and temporal smoothness (regularisation term): E(S) = ED (S) + ER (S).
(4)
The first part of the data term is local, it does not consider correspondences between consecutive frames. In accordance with eq. (2) it measures the photometric error:
2 1 1
It,k (ut ) − αk (ut ) max(lTk n(ut ), 0) . (5) EDph (S) = |T | |Dt | t∈T
ut ∈Dt k=r,g,b
To guarantee unit length of normals and albedo temporal constancy we define the following terms: 1 1 n(ut )T n(ut ) − 12 , EDn (S) = (6) |T | |Dt | t∈T
EDa (S) =
ut ∈Dt
1 1 2 α(ut+1 ) − α(ut ) . |T | |Dt | t∈T
(7)
ut ∈Dt
The data term is then the weighted sum of the three terms above: ED (S) = μph EDph (S) + μn EDn (S) + μa EDa (S).
(8)
The regularisation term is destined to measure the smoothness of the model. It is composed of a weighted sum of three terms: ER (S) = μs ERs (S) + μr ERr (S) + μt ERt (S).
(9)
The first term quantifies the spatial smoothness, i.e. the deviation of neighbouring normals: 1 1 1 ERs (S) = n(ut ) − n(vt )2 , (10) |T | |Dt | |N | t∈T
ut ∈Dt
vt ∈N (ut )
where N means the set of neighbouring image points. The second term measures the rigidity of the model. We require displacements of nearby points to be as similar as possible: 1 1 1 2 d(ut ) − d(vt ) . (11) ERr (S) = |T | |Dt | |N | t∈T
ut ∈Dt
vt ∈N (ut )
60
Z. Jank´ o, A. Delaunoy, and E. Prados
Third, normals and displacement vectors are reinforced to change smoothly in time (temporal smoothness): ERt (S) =
1 1 2 n(ut+2 ) − 2n(ut+1 ) + n(ut ) + |T | |Dt | t∈T
ut ∈Dt
d(ut+2 ) − 2d(ut+1 ) + d(ut )2 , (12) where ut+1 = ut + d(ut ) and ut+2 = ut + d(ut ) + d(ut + d(ut )). 2.2
Optimisation
In optimisation, quasi-Newton methods are widely used to minimise energy functionals. We have selected one of the most popular members of this class, namely the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method to optimise our energy functional (4). In our experiments we used the limited memory variant (L-BFGS) of the algorithm, included in the ALGLIB library implemented by Sergey Bochkanov [16]. For an iterative optimisation method, such as the L-BFGS, it is essential to choose a good starting point. Our initialisation consists of three phases. First, motion is tracked to estimate the displacement of the points, second, a rough guess is given to the albedos based on the variation of the colour values, and finally the normals are computed using the albedos. Due to the fact that the scene is illuminated by three coloured light sources, image colour is not sufficient to follow surface motion, as surface points change their colour significantly over time. Consider fig. 2a where the same surface particles change their colour from red to green and blue, depending on whether they look towards the red, the green or the blue light source. Although from a global aspect the colours vary, local changes of the texture remain the same, thus they are traceable. According to this, we apply a feature detector method, namely the Sobel edge detector, in order to emphasise local variations in the images. (See fig. 2b.) To track the motion and estimate displacement vectors d(ut ) we applied the high-accuracy optical flow method of Brox et al. [17] and its variation [18], implemented by Chari [19], on the edge maps. Note that the related work of Hernandez et al. [13] also uses optical flow to track the deformation of the surface. However, they use optical flow after normal and depth maps are already computed and their goal is to put them in correspondence. In contrast, we use optical flow before obtaining the normals, in order to constrain the ill-posed problem by assuming temporally constant albedos. Some other methods [11, 20, 21] also use optical flow to improve photometric stereo for dynamic objects, but since they do not use coloured light sources (no spectral multiplexing), we omit discussing them in details. With the estimated displacement vectors in hand, we can give a rough approximation to the albedos. On closer inspection of eq. (2) one can see that Ik (ut ) reaches its maximum at a time instant t if n(ut ) looks towards light source k, and at that time lTk n(ut ) = lk , i.e. αk (ut ) = Ik (ut )/ lk . Consequently, if we assume the surface dynamic enough and the sequence long enough, so that all
Colour Dynamic Photometric Stereo for Textured Surfaces
a)
61
b)
Fig. 2. a) Due to coloured illumination, deforming the surface can change its colour from red to green and blue; b) in contrast, edge detection emphasises local colour changes
normals turn towards each light sources at least once, the following estimation is a good approximation to the albedos: α∗k (u) = max t∈T
Ik (ut ) , k = r, g, b. lk
(13)
In the final step of the initialisation normals are computed solving a system of linear equations based on eq. (2) using the already estimated albedos.
3 3.1
Experiments and Discussion Datasets
Three real test sequences have been created in our studio [22] using three coloured light sources and one camera. The datasets contain sequences of about 150 images acquired from the same viewpoint. The image resolution is 1624 × 1224. Since our method requires calibrated light sources, we applied a simple calibration method. For this we used some additional calibrated cameras and created a short sequence of a white balance card that provides a precise uniform surface that is spectrally neutral under all lighting conditions. (See fig. 3.) Detecting the card’s corners their 3D co-ordinates and hence the fitting plane can be determined using the multiple camera setup. The light sources’ properties (direction and intensity) are computed from the plane’s normal and the sensor response observed.
Fig. 3. Calibration sequence: from the corners of the white balance card the plane’s varying normal can be computed and used to calibrate light sources
62
Z. Jank´ o, A. Delaunoy, and E. Prados
Fig. 4. Results on real datasets: input images (left), albedos (centre), normals (right)
Colour Dynamic Photometric Stereo for Textured Surfaces
63
In order to quantitatively assess the results, we have also created a synthetic dataset (fig. 6, first column). The dataset contain a 10 frames sequence of 800 × 800 images about textured waves. Ground truth normals enable to measure the reconstruction error and to compare our result to another state-of-the-art method. 3.2
Results
Fig. 4 shows the results on real datasets: the original input images, the albedo maps and the colour normal images. Our method is capable of separating surface reflectance (albedo) from surface direction (normal). This is clearly demonstrated by the results: the albedo map is constant over time, the red-green-blue colour changes (visible in the input images) disappear; at the same time, the texture, for instance the stripes of the shirt are not present in the normal image. If we assumed uniform albedo, as in [13], stripes would appear in the normal image, not in the albedo map. To visualise the quality of the results we have integrated the normal maps to obtain 3D surfaces. For this we have used the method of Kovesi [23] and his implementation that is available on his web-site [24]. Fig. 5 shows examples for the surfaces obtained for each real datasets. Our supplemental material contains the complete videos of the reconstructed normals and surfaces.
Fig. 5. Examples for reconstructed surfaces for each real datasets
We have tested our method on synthetic data as well and compared the result to the state-of-the-art method of Hernandez et al. [13]. Since their method assumes uniform albedo, texture changes incorrectly appear in the normal maps. (See figs. 6 and 7.) On the contrary, our method clearly separates albedo and normal maps. To quantitatively assess the results, we have compared the normals obtained to the ground truth data. Considering the angle (in degrees) between the resulted and the ground truth normal, the mean and the confidence interval were calculated: 3.45 ◦ ± 0.0035 with our results and 15.32 ◦ ± 0.0106 for the method of [13]. Fig. 8 shows the histogram of the error image, which illustrates the error distribution. Note that the reconstruction’s accuracy decreases towards the edge of the image, where tracking is less precise (fig. 8, left).
64
Z. Jank´ o, A. Delaunoy, and E. Prados
Fig. 6. Results on a synthetic dataset: input images (left), albedos (centre), normals (right). First two rows: our method. Last two rows: assuming uniform albedo [13].
Fig. 7. Examples for reconstructed surfaces for a synthetic dataset. Left: our method. Right: assuming uniform albedo [13].
Colour Dynamic Photometric Stereo for Textured Surfaces
65
Fig. 8. Error images (left) and their histogram (right). First row: our method. Second row: assuming uniform albedo [13].
4
Conclusion
In this paper we have discussed the problem of applying photometric stereo on dynamic surfaces. We have presented a novel colour photometric stereo method that successfully recovers normals for textured surfaces as well. Our method is an extension of [13], which works only for surfaces with uniform albedo. Assuming the albedos’ temporal constancy and tracking the surface’s local changes initially, we have formulated the problem as an optimisation that forces the model to fit the photometric constraint and encourage spatial and temporal smoothness. Tests on real and synthetic datasets demonstrate the efficiency of our method. Acknowledgments. This work was supported by the French National Agency for Research (ANR) under grant ANR-06-MDCA-007 and partially by the NKTH-OTKA grant CK78409 and by the HUNOROB project (HU0045, 0045/NA/2006-2/P-9), a grant from Iceland, Liechtenstein and Norway through the EEA Financial Mechanism and the Hungarian National Development Agency. – The authors thank Visesh Chari for his optical flow code and Thomas Dupeux for his help at data acquisition.
References 1. Aganj, E., Pons, J.P., S´egonne, F., Keriven, R.: Spatio-temporal shape from silhouette using four-dimensional Delaunay meshing. In: Proc. ICCV 2007 (2007)
66
Z. Jank´ o, A. Delaunoy, and E. Prados
2. Courchay, J., Pons, J.-P., Monasse, P., Keriven, R.: Dense and accurate spatiotemporal multi-view stereovision. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5995, pp. 11–22. Springer, Heidelberg (2010) 3. Furukawa, Y., Ponce, J.: Dense 3D motion capture from synchronized video streams. In: Proc. CVPR 2008 (2008) 4. Pons, J.P., Boissonnat, J.D.: Delaunay deformable models: Topology-adaptive meshes based on the restricted Delaunay triangulation. In: Proc. CVPR 2007 (2007) 5. Starck, J., Hilton, A.: Surface capture for performance based animation. IEEE Computer Graphics and Applications 27, 21–31 (2007) 6. Vedula, S., Baker, S., Kanade, T.: Image-based spatio-temporal modeling and view interpolation of dynamic events. ACM Trans. on Graphics 24, 240–261 (2005) 7. Vlasic, D., Baran, I., Matusik, W., Popovi´c, J.: Articulated mesh animation from multi-view silhouettes. ACM Transactions on Graphics 27 (2008) 8. Woodham, R.J.: Photometric stereo: A reflectance map technique for determining surface orientation from image intensity. In: Image Understanding Systems and Industrial Applications, Proc. SPIE, vol. 155, pp. 136–143 (1978) 9. Smith, M.L., Smith, L.N.: Dynamic photometric stereo–a new technique for moving surface analysis. Image and Vision Computing 23, 841–852 (2005) 10. Timo, P., Detlef, P., Pertti, K. (inventors): Arrangement and method for inspection of surface quality. European Patent Application, EP 1 030 173 A1 (2000) 11. Vlasic, D., Peers, P., Baran, I., Debevec, P., Popovi´c, J., Rusinkiewicz, S., Matusik, W.: Dynamic shape capture using multi-view photometric stereo. In: Proc. SIGGRAPH Asia 2009, pp. 1–11 (2009) 12. Drew, M.S.: Shape from color. Technical report, School of Computing Science, Simon Fraser University, Vancouver, B.C. (1992) 13. Hernandez, C., Vogiatzis, G., Brostow, G., Stenger, B., Cipolla, R.: Non-rigid photometric stereo with colored lights. In: Proc. ICCV 2007, pp. 1–8 (2007) 14. Kontsevich, L., Petrov, A., Vergelskaya, I.: Reconstruction of shape from shading in color images. J. Opt. Soc. Am. A 11, 1047–1052 (1994) 15. Finlayson, G.D., Hordley, S.D., Lu, C., Drew, M.S.: On the removal of shadows from images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 59–68 (2006) 16. Bochkanov, S.: ALGLIB, http://www.alglib.net 17. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004) 18. Sand, P., Teller, S.: Particle video: Long-range motion estimation using point trajectories. In: Proc. CVPR 2006, pp. 2195–2202 (2006) 19. Chari, V.: High accuracy optical flow, http://www.mathworks.com/matlabcentral/ fileexchange/17500-high-accuracy-optical-flow 20. Ma, W.C., Hawkins, T., Peers, P., Chabert, C.F., Weiss, M., Debevec, P.: Rapid acquisition of specular and diffuse normal maps from polarized spherical gradient illumination. In: Proc. EGSR 2007 (2007) 21. Wenger, A., Gardner, A., Tchou, C., Unger, J., Hawkins, T., Debevec, P.: Performance relighting and reflectance transformation with time-multiplexed illumination. ACM Trans. Graph. 24, 756–764 (2005) 22. INRIA: GrImage Platform, http://grimage.inrialpes.fr/index.php 23. Kovesi, P.: Shapelets correlated with surface normals produce surface. In: Proc. ICCV 2005, pp. 994–1001 (2005) 24. Kovesi, P.: MATLAB and octave functions for computer vision and image processing, http://www.csse.uwa.edu.au/~ pk/Research/MatlabFns/
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues Min Li, Wei Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences {mli,wchen,kqhuang,tnt}@nlpr.ia.ac.cn
Abstract. This paper proposes a novel particle filtering framework for multi-target tracking by using online learned class-specific and instancespecific cues, called Data-Driven Particle Filtering (DDPF). The learned cues include an online learned geometrical model for excluding detection outliers that violate geometrical constraints, global pose estimators shared by all targets for particle refinement, and online Boosting based appearance models which select discriminative features to distinguish different individuals. Targets are clustered into two categories. Separatedtarget is tracked by an ISPF (incremental self-tuning particle filtering) tracker, in which particles are incrementally drawn and tuned to their best states by a learned global pose estimator; target-group is tracked by a joint-state particle filtering method in which occlusion reasoning is conducted. Experimental results on challenging datasets show the effectiveness and efficiency of the proposed method.
1
Introduction
Multi-target tracking (MTT) plays an important role in applications like visual surveillance, intelligent transportation and behavior analysis. It is still a challenging problem in computer vision community. Challenges include but are not limited to target auto-initialization, inter-object occlusions and intensive computational cost caused by joint-state optimization. Extensive work has been done on multi-target tracking for the last decades. Some formulates MTT as a data-association problem between candidate regions and target trajectories. Multiple hypothesis tracker (MHT) [1] and joint probabilistic data association filter(JPDAF) [2] are two classical data-association methods widely used in computer vision context, e.g. the MHT in [3] and JPDAF in [4]. However, MHT/JPDAF has to exhaust all possible associations and thus is computationally intensive. Recently, sampling-based techniques have been proposed for data association. In [5], an MCMC (Markov Chain Monte Carlo) based variant of the auxiliary variable particle filter is proposed, in which the MCMC sampling simulates the probability of a data association. Since dateassociation based methods highly depend on reliable low-level detection results and often do not consider the interactions between targets, they are subject to bad detection results or occlusions. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 67–81, 2011. Springer-Verlag Berlin Heidelberg 2011
68
M. Li et al.
There is also much work that focuses on multi-target tracking by Bayesian inference in state space. Data association may be involved, but not necessary. Isard et al. [6] propose a Bayesian multi-blob tracker which combines a multi-blob likelihood function with a particle filter. In [7], Adaboost and mixture particle filters [8], which model the filtering distribution as an M-component mixture model, are combined to detect and track multi-targets fully automatically. In [9], joint posterior is approximated by a variational density, based on which a set of autonomous while collaborative trackers are used to overcome the coalescence problem (several trajectories stick to one same target). Qu et al. [10] propose an interactively distributed multi-target tracking algorithm using a magnetic-inertia potential model to solve the multiple object labeling problem in the presence of occlusions. Khan et al. [11] propose a novel MCMC sampling step to obtain an efficient multi-target filter in joint-state space. However, this model can not deal with occlusion. Yang et al. [12] propose a game-theoretic multiple target tracking algorithm, in which the tracking problem is solved by finding the Nash Equilibrium of a game. Zhang et al. [13] propose a species based PSO (particle swarm optimization) algorithm for multiple object tracking, which brings a new view to the tracking problem from a swarm intelligence perspective. Hess et al. [14] present a discriminative training method for particle filters and attempt to directly optimize the filter parameters in response to observed errors. Breitenstein et al. [15] use the continuous confidence of pedestrian detectors and online trained, instance-specific classifiers as a graded observation model to track multi-person in the particle filtering framework. In this paper, a data-driven particle filtering (DDPF) method for multi-target tracking is proposed, and multi-human tracking is used as a study case. Object detection is not a necessary part, but considering target auto-initialization, detection is still introduced into the tracking process. The proposed DDPF framework is not a pure stochastic state inference method, instead, a set of classspecific or instance-specific cues, meaning some kind of models learned online from low-level data, are used to make the state inference process more efficient and reliable. Targets are clustered into tow categories. Separated-target is tracked by an ISPF (incremental self-tuning particle filtering) [16] method, in which particles are incrementally drawn and tuned to their best states by a learned global pose estimator; target-group is tracked by a joint-state space particle filtering method in which occlusion reasoning between group members is conducted. The remainder of the paper is organized as follows. Section 2 gives an overview of the proposed framework. Section 3 introduces our detection and data association method. The Bayesian inference process for multi-target tracking is described in Section 4. Experimental results and analysis are presented in Section 5. Finally, we draw our conclusions in Section 6.
2
Overview of the Proposed Framework
In particle filtering framework, tracking can be casted as an inference task in a Markov model with hidden state variables [17]:
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
69
p(Xt |Ot ) ∝ p(ot |Xt )
p(Xt |Xt−1 )p(Xt−1 |Ot−1 )dXt−1
(1)
where Xt describes the state variable at time t and Ot = {o1 , o2 , ..., ot } is a set of observations. The tracking process is governed by the observation model p(ot |Xt ) and the dynamic model p(Xt |Xt−1 ) between two states. In our framework, separated-target is tracked by a single particle filter, and Xt contains only one single state St (St = {x, y, s}, meaning x, y translation and scale); for a group with n targets, Xt contains the joint states of several targets St = {Sit }ni=1 , as well as the occlusion matrix π t = {πtij }i<j≤n with n(n − 1)/2 independent elements. Occlusion variable πtij has three values: −1 means target i is occluded by target j, 1 means the opposite, and 0 means they have no occlusion relationship.
Fig. 1. The flowchart of the proposed Data-Driven Particle Filtering (DDPF)
Fig. 1 shows the flowchart of the proposed Data-Driven Particle Filtering framework for multi-target tracking. In the target auto-initialization module, a detector is used to detect targets in video frames, and detection outliers that violate projective projection are excluded by a geometrical model learned by Local Weighted Projection Regression (LWPR) [18]. Then remaining detections are associated with existing targets by a data-association method, in which online Boosting classifiers are utilized. New targets are born from detections without any association and they form a Target Pool with old targets. We then classify targets in Target Pool into two categories: separated-target and target-group by a simple clustering method. Separated-target is tracked by an ISPF tracker in single-state space, and target-group is tracked by an MCMC-PF method in joint state space with occlusion reasoning. Global pose estimators and instancespecific Boosting classifiers are utilized in the state optimization process to make it more efficient and reliable. After deleting some vanished targets, the obtained optimal states are then used to update the global pose estimators, online Boosting classifiers and object’s appearance model.
70
3
M. Li et al.
Detection and Association
Since the PSOF (Pyramidal Statistics of Oriented Filtering) [19] method exhibits good performance under different imaging conditions, we use it in our human detection module. Geometrical Model. No perfect detector exists, any detector may produce false positives. To eliminate outliers that obviously violate perspective projection, like object 2 shown in Fig. 2 (a), a geometrical model that describes the relationship between object’s height (denoted by H) and object’s foot point (denoted by (xf , yf ), see Fig. 2 (b)) position is learned by an online regression method called Local Weight Projection Regression (LWPR) [18], which can use piece-wise linear models to approximate non-linear function. Details of LWPR can be seen in [18]. In every frame, training samples are all current detections and they are fed to the LWPR regressor to incrementally learn the geometrical model H(xf , yf ). A detection with height h and foot point (xf , yf ) is regarded as a outlier if: |H(xf , yf ) − h| > Th H(xf , yf )
(2)
where Th is a threshold (usually set to 0.3). An example of the learned geometrical model is shown in Fig. 2 (c), which is nearly a plane. The surface shape of the geometrical model depends on the the number and locations of vanishing points of scene structures, and a plane surface usually means one-point perspective. One advantage of the proposed geometrical model is that it can incrementally learn scene geometrical information without camera calibration.
Fig. 2. An example of online learned geometrical model for excluding detection outliers that violate perspective projection. (a) Detection results. Object 2 (in red) is an outlier. (b) Definitions of foot point and object height. (c) Learned surface of object height vs object’s foot point. Blue points are online collected training samples.
Data Association. After object detection and filtering the outliers, detection results should be associated with trajectories of old targets. First, a similarity matrix between trajectory-detection pairs is computed. The similarity score Simda between a trajectory tr and a detection d is computed as a product of appearance similarity and distance similarity:
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues Simda (tr, d) = Simc (tr, d) · pN (|d − tr|)
71 (3)
2 N (|d − tr|; 0, σdet )
where pN (|d − tr|) ∼ is a Gaussian function used to measure the distance similarity between tr and d and Simc (tr, d) is the output (scaled to (0, 1) by a sigmoid function) of an online Boosting classifier used to measure the appearance similarity. Then, a simple matching algorithm is conducted: the pair (tr∗ , d∗ ) with the maximum score is iteratively selected, and the rows and columns belonging to trajectory tr and detection d in the similarity matrix are deleted. Note that only the association with a matching score above a threshold is considered as a valid match.
Fig. 3. Comparison of appearance similarity matrices. (a) trajectories and detections. (b) obtained by feature template matching. (c) obtained by Boosting classifiers. In (b) and (c), yellow block means the maximum of a column and red means the second maximum.
Online Boosting Classifiers. Instance-specific Boosting classifiers are used in data association, as well as the observation model for tracking. Our Boosting classifier is similar to [20] and is trained online on one target against all others. Patches used as positive training examples are sampled from the bounding box of the optimal state given by the tracking process. The negative training set is sampled from all other targets, augmented by background patches. Fig. 3 shows an comparison of appearance similarity matrices obtained by feature template matching (see Section 4.2) and Boosting classifiers. As shown in Fig. 3 (b) and (c), the discriminability of online Boosting is much more powerful than feature template matching, because the average ratio of column maximum to second maximum of online Boosting is 14.5, while that of feature template matching is only 1.3.
4
Tracking by Particle Filtering
To decompose the full joint-state optimization problem in multi-target tracking into a set of “easy-to-handle” low-dimensional optimization problems, we adopt a “divide-and-conquer” strategy. Targets are clustered into two categories: separated-target and target-group. Separated-target is tracked by a single-state particle filter and target-group is tracked by a joint-state particle filter. A set of targets G = {tri }ni=1 with n ≥ 2 is considered as a target-group if: Clustering Rule: ∀ tri G, ∃ trj G(j = i), s.t. |Si Sj |/min(|Si |, |Sj |) > Tc
72
M. Li et al.
Fig. 4. An example of training and using of the full-body pose estimator. (a) the training process. (b) pose-tuning process. S0 (red) is a random particle, S3 (yellow) is the tuned particle with the maximum similarity, others are intermediate states (black) (c) the curve of similarity vs iteration# for (b).
where Si is the optimal state of target tri at frame #t-1 or its prediction state at frame #t using only previous velocity, | · | is the area of a state, and Tc is a threshold (usually set to 0.4). This rule makes target pairs that have or are going to have significant overlapping areas be in the same group. Targets that do not belong to any group are separated targets. 4.1
Global Pose Estimators
Sampling-based stochastic optimization methods usually need to sample a lot of particles, the number of which may grow exponentially with the increase of state-dimension. However, most particles have almost no contribution to state estimation because of their extremely small weights, while they consume most of the computational resources. To tremendously decrease the unnecessary computational cost, ISPF (incremental Self-tuning Particle Filtering)[16] uses an online learned pose estimator to guide random particles to move towards the correct directions, thus every random particle becomes intelligent and dense sampling is unnecessary. Inspired by [16], we also use pose estimators in multi-target tracking. Different from one pose estimator per target in [16], we train two global pose estimators online for all targets. One is ff b trained with full-body human samples, the other is fhs trained with only the head-shoulder parts of human samples. fhs is used when most parts of a target is occluded by other targets except the head-should part. Fig. 4 shows an example of how to train and use a pose estimator. As shown in Fig. 4 (a), we do state perturbation dS around the optimal state S of each target to get training samples, then HOG (Histogram of Oriented Gradients) [21] like features F are extracted and LWPR is used to learn the regression function dS = f (F). By using the regression function, as shown in
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
73
Fig. 4 (b) and (c), given an initial random state around a target, we can guide it move to its best state with the maximum similarity in several iterations (each iteration includes two main steps: ΔSi = f (F(Si )) and Si+1 = Si − ΔSi , see [16] for details). 4.2
Observation Model
To make appearance model robust to partial occlusions, we use a part-based representation for each target, shown in Fig. 5 (a) p1 ∼ p7 . As shown in Fig. 5 (a), each part is represented by a L2-normalized HOG histogram and a L2normalized IRG (Intensity/Red/Green) color histogram: Fj ={Fjhog , Fjirg }. The final representation is the concatenation of all features F = {Fj }m j=1 , where m is the number of all parts. Note that feature F is also the input of online boosting classifiers, and it is not the same F used in pose estimator.
Fig. 5. Part based object representation and calculation of the visibility vector. (a) Target is divided into m parts (denoted by {pj }m j=1 , here m=7), and each part is represented by a HOG histogram and an IRG (Intensity/Red/Green) histogram denoted as Fj ={Fjhog , Fjirg }. (b) An example of calculation of visibility vector. Elliptical regions are the valid areas of targets. Knowing the configuration of states {Si } and their occlusion relations {π ij }, the visibility vector of S1 can be easily obtained: (1,1,1,1,0,1,1).
Part based Feature Template Matching. Only using the online trained Boosting classifiers for appearance model may suffer from occlusions, thus a partbased template matching method is introduced to our appearance model. Given the state configuration X = {S, π } of a target-group (subscript t is dropped), where S={Si }ni=1 is the joint state and π = {π ij }1≤i<j≤n is the occlusion matrix, an m-dimensional visibility vector V =(V1 , ..., Vm ) for each target can be calculated (1: visible, 0: invisible). In our work, a part is invisible if the ratio of its area covered by other targets exceeds some threshold (usually set to 0.5). Fig. 5 (b) shows an example of how to calculate the visibility vector. For separated target, all parts are visible. The template similarity score of a state S (subscript is dropped) to a target tr is defined as: m SimT (tr, S) = 1 −
Fj − Fj · Vj 2· m j=1 Vj
j=1
(4)
74
M. Li et al.
m where F = {Fj }m j=1 is the feature vector specified by S, F = {Fj }j=1 is the reference template of target tr, and V =(V1 , ..., Vm ) is the visibility vector. If m j=1 Vj =0, which means target tr is totally occluded by other targets, SimT (tr, S) is set to the average template matching score of background patches. Multi-Source Similarity Fusion Given a random state S, its appearance similarity Sima to a target tr is defined as a weighted combination of multiple sources: Sima (tr, S) = α1 · pN (|d∗ − S|) + α2 · SimT (tr, S) + α3 · Simc (tr, S)
(5)
where d∗ is the associated detection of target tr and {αi }3i=1 are the weights of different sources with α1 + α2 + α3 =1 (α1 is set to 0 if there is no associated detection). The first term depicts how close S is to the associated detection d∗ , the second term is the similarity obtained by template matching, and the third term is the similarity obtained by the Boosting classifier of target tr. Usually we set the weights α1 , α2 , α3 to 0.2, 0.4, 0.4 respectively. The final observation model of a target group G = {tri }ni=1 with state configuration X (subscript t is dropped) can be defined as: P (o|X) =
n
P (oj |{Si }n i=1 , π ) =
j=1
4.3
n
Sima (trj , Sj )
(6)
j=1
Dynamical Model
In particle filtering framework, besides observation model P (ot |Xt ), another important part is the dynamical model P (Xt |Xt−1 ). As in [9], we also introduce interaction potentials to the dynamical model, which aims to penalize the case that two targets occupy the same position. Suppose G = {tri }ni=1 is a target group with joint state {Si } and occlusion matrix π , the dynamical model is defined as follows: P (Xt |Xt−1 ) =
n 1 πt) U(π ψi (Sit |Si(t−1) ) ψij (Sit , Sjt ) Zc i=1 1≤i<j≤n
(7)
π t ) is a uniform distribution with only where Zc is the normalizing factor, U(π two values: 0 and 1, ψi is a local motion prior modeled by a Gaussian N (Sit − Si(t−1) ; 0 , Σ s ), and ψij is the interaction potential between target-pair defined as: ψij = exp(−
i(t−1) − S j(t−1) 2 βS ) Sit − Sjt 2
(8)
i(t−1) and S j(t−1) are the optimal states of target tri and trj at frame where S #t-1, and β is a constant factor (usually set to 0.1). 4.4
Tracking Separated-Target
Separated target is tracked by the ISPF (Incremental Self-tuning Particle Filtering) [16] tracker. Its observation model and dynamical model can be easily
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
75
derived, because they are special cases of those of target-group. The main steps of ISPF include: 1) particles are incrementally drawn; 2) each random particle is tuned to its neighborhood best state according some similarity feedback; 3) state optimization is terminated if the maximum similarity score of all tuned particle satisfies a TSD (target similarity distribution) modeled as an online Gaussian, or the maximum number of particles Ns is reached. However, there are some differences between the ISPF tracker in this paper and that in [16]. First, we train two global pose estimators for all target, not one for each. Second, the observation models are very different. We introduce online Boosting classifier to appearance model, meanwhile we do not use IPCA (incremental principle component analysis), but a time-varying feature template updated every frame with a constant learning rate to represent target appearance. This is because it is computationally expensive to maintain and update one subspace for each part of each target. Details of how ISPF work can be seen in [16]. 4.5
Tracking Target-Group
With the defined observation model (6) and dynamical model (7), the Bayesian posterior density (1) can be approximated by particle based density: π t )P (ot |Xt ) P (Xt |Ot ) ≈ c · U(π
ψij (Sit , Sjt )
Ng n k=1 i=1
1≤i<j≤n
(k)
ψi (Sit |Si(t−1) )
(9)
where c is the normalizing factor, Ng is the number of particles, and n is the number of targets in a target-group. The next problem is how to efficiently sample particles for this posterior which lies in a high dimensional state space. Markov chain Monte Carlo (MCMC) sampling [11] is an effective solution to this problem. MCMC methods work by defining a Markov Chain over the space of configurations X, such that the stationary distribution of the chain is equal to a target distribution P (X|O). The Metropolis-Hastings (MH) algorithm [22] is one way to simulate from such a chain. It works as follows: Algorithm 1. Metropolis Hastings Algorithm Start with a random particle X(1) , then iterate for k = 1, ..., N 1. Sample a new particle X from the proposal density Q(X ; X(k) ); 2. Calculate the acceptance ratio: a = min{
P (X |O)Q(X(k) ; X ) , 1} P (X(k) |O)Q(X ; X(k) )
(10)
3. Accept X and set X(k+1) =X with probability a, or set X(k+1) =X(k) . The proposal density used here is a product of a uniform distribution for occlusion varibles and the joint local motion prior centered on previous optimal states: (k)
π t ) Q(Xt ; Xt ) = Q(Xt ) ∼ U(π
ˆ i(t−1) ). ψi (Sit |S
(11)
i
Besides using MCMC-sampling to efficiently sample particles for posterior density, we also utilize global pose estimators to slightly refine the random sampled
76
M. Li et al.
particles for each target if possible (at least the head-should part is visible), so as to make the joint-state particle “smart”. The main steps of our data-driven MCMC-based particle filtering are given in Algorithm 2 . Algorithm 2. Data-Driven MCMC-based Particle Filtering ˆ t−1 , propagate it with the 1. Start with previous optimal state configuration X local joint motion prior density, then iterate for k = 1, ..., Ng – (a) Sample a new particle X from the proposal density (11) – (b) Calculate the visibility vector V i for each target tri according to the joint states {Si } and current occlusion matrix π . : for each target tri , tune its pose {S } to {S } by a global – (c) Refine X to X i i pose estimator (the full-body pose estimator takes priority) if at least the head-should part is visible. – (d) Recalculate the visibility vector for each target and compute particle ) according to the observation model (6); weight w = P (o|X – (e) Calculate posterior density (9) and the acceptance ratio (10) with tuned state X and set {X(k+1) , w(k+1) }={X , w } with probability a, or reject – (f) Accept X t t (k+1) (k+1) (k) (k) , wt }={Xt , wt }. it and set {Xt ∗
∗
t = X(k ) with w(k ) = max{w(k) }. 2. state estimation: X t t t (k) (k) 3. Resample Ng particles from {Xt , wt } and set all weights to 1/Ng
5
Experimental Results
Three sequences are used to test the robustness and efficiency of our tracking algorithm. They are the NLPR-Campus (Fig. 6, our own dataset), PETS09 [23] (Fig. 9) and PETS04 [24] (Fig. 11) sequence. Some important parameters are set as follows: the maximum number of particles Ns for separated-target is set to 100, and the number of particle Ng for target-group is set to 200. Gaussian noise covariance matrix Σ S =diag{72, 72 , 0.022}. Without taking the detection time into consideration, the MATLAB implementation of our method can run at about 1 ∼ 0.4 frames/sec on a PC with a P4 2.4 GHz CPU. Robustness and Efficiency Evaluation. Two sequences (the NLPR-Campus and PETS09 sequence) are used to test the robustness of our algorithm. The PSOF detector [19] is used to detect objects in the two sequences with the detection rate of about 0.5 ∼ 0.7, and the online learned geometrical models decrease false alarms by about 10% in each sequence. Targets are only initialized on the border areas of an image except in the first frame, and their trajectories are terminated if they move out of image borders. In the NLPR-Campus sequence, as shown in Fig 6, six objects with three of them having interactions appear in the scene. At about frame #118, obj1 (means the object with ID 1) is occluded by obj2 slightly, but occlusion reasoning is not conducted because their valid overlapping area (the intersection area of two
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
77
Fig. 6. Tracking results of our method in the NLPR-Campus Sequence
Fig. 7. Occlusion reasoning between object 1 and 2 in the NLPR-Campus Sequence. (a) Occlusion relationship. (b) Appearance similarity scores of object 1 and 2.
Fig. 8. Particles used in the NLPR-Campus seqence. (a) and (b): random particles (blue rectangles) and estimated optimal states (yellow rectangles). (c) The curve of average number of particles per object vs frame #.
ellipses) has not exceeded a threshold (0.4). At about frame #131, obj1 and obj2 form a target-group, and thus the MCMC-based particle filtering process with occlusion reasoning is activated. As shown in Fig. 7 (b), after occlusion reasoning, the appearance similarity score of obj1 increases, this is because our part-based template matching score becomes more accurate when occlusion relationship is correctly recovered (shown in Fig. 7 (a)). At about frame #138, obj1 is totally occluded by obj2 and this occlusion last for 50 frames. At about frame #189, obj1 passes by obj2 and obj3 and reappears, our algorithm successfully tracks
78
M. Li et al.
Fig. 9. Tracking results of our method in the PETS09 sequence
Fig. 10. Particles used in the PETS09 seqence. (a) and (b): random particles (blue rectangles) and estimated optimal states (yellow rectangles). (c) The curve of average number of particles per object vs frame #.
him. Although obj1 and obj3 wear similar clothes (white T-Shirt and black trousers), ID-switching does not happen, because the online-Boosting classifiers can choose discriminative features to distinguish them. Our algorithm can not only robustly track every object, but also work efficiently. As shown in Fig. 8 (c), except in the frames in which significant inter-occlusion happens (#131 ∼ #189), our algorithm uses an average number of less than 10 particles per object. The average number of particles per object per frame throughout the sequence is 21.4. In Fig. 8 (a) and (b), some random particles are shown. It can be seen that separated-targets (obj1/obj2/obj3 at #90; obj3/obj4/obj5/obj6 at #164) often consume a very small number of particles (due to the usage of global pose estimators), while large number of particles are used by target-group (obj1-obj2 at #164) in which joint-state optimization is conducted. The PETS09 sequence is more challenging, as shown in Fig. 9. There are not only frequent inter-object occlusions (obj3-obj1 at #25; obj2-obj1 at #104; obj8-obj2 at #104; obj8-obj3 and obj6-obj4 at #163; obj6-obj1 and obj11-obj10 at #347 ), but also frequent background-object occlusions caused by a signboard (obj3 at #104/#144, and obj10 at #270). Besides, obj1, obj2 and obj3 wear
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
79
Fig. 11. Tracking results of our method in the PETS04 sequence
very similar clothes and have frequent interactions. In this complex scene, our algorithm consistently robustly tracks all objects and maintains their IDs correctly. However, the average computational cost is very low. As shown in Fig. 10 (a) and (b), only those objects in a target-group (obj2-obj7 at #131) or occluded by the signboard (obj3 at #131) consume large number of particles, while others consume several to ten particles per object. The curve of average number of particles per object vs frame No. is shown in Fig. 10 (c), and the average number of particles per object per frame is 26.4. The robustness of our algorithm shown on this sequence relies on two main factors. The first factor is the online learned Boosting classifiers. They make the detection association process robust, as well as provide part of the similarity score in the observation model. The second is the MCMC-based joint-state optimization strategy utilized when object-pairs have significant overlapping areas. Because of using part-based appearance model and occlusion reasoning, the MCMC-based particle filtering process is very robust to partial occlusions. The efficiency of our method obviously benefits from the target clustering strategy and the online trained global pose estimators. Accuracy Evaluation. To evaluate the accuracy of our tracking algorithm, as shown in Fig. 11 the P ET S04 sequence is used, in which three objects are labeled and their groundtruth trajectory can be downloaded from [24]. We use the groundtruth object bounding boxes in the first frame to initialize targets, and no detector is used during tracking. The RMSE (root mean square error) between estimated position and groundtruth of our method is shown in Table. 1. It is about 3.9 pixels per object per frame. As comparison, the tracking accuracies of three other state-of-the-art multi-target tracking algorithms: Yang’s work [12], Qu’s work [10] and Zhang’s work [13] are also given in Table 1 (their results are reported in [13]). It can be seen that the tracking accuracy of our algorithm outperforms [12] (RMSE: 11.2 pixels) and [10] (RMSE: 6.3 pixels), and a little inferior to [13] (RMSE: 3.2 pixels). This is because in Zhang’s work [13], object’s appearance is modeled by an online learned subspace which is a little better than our template based representation. Table 1. Tracking accuracy comparison on the PETS04 sequence Algorithm Our method Yang’s work Qu’s work Zhang’s work RMSE per object per frame (pixels) 3.9 11.2 6.3 3.2
80
M. Li et al.
Discussion. From the above experimental results, we can obtain some useful findings: 1) Online learned Boosting classifiers plays an important role to distinguish targets with similar appearance. 2) Without camera calibration, the online learned geometrical model can effectively reduce false detections that violate perspective projection. 3) The online learned global pose estimators tremendously increase the tracking efficiency of our algorithm. 4) The MCMC-based joint-state optimization process can effectively deal with partial occlusions.
6
Conclusions
This paper proposes a Data-Driven Particle Filtering (DDPF) framework for multi-target tracking, in which online learned class-specific and instance-specific cues are used to make the tracking process robust as well as efficient. The learned cues include an online learned geometrical model for excluding detection outliers that violate geometrical constraints, global pose estimators shared by all targets for particle refinement, and online boosting based appearance models which select discriminative features to distinguish different individuals. Targets are clustered into two categories: separated-target and target group. Separatedtarget is tracked by an ISPF (Incremental Self-tuning Particle Filtering) method in low-dimensional state space and target-group is tracked by an MCMC-based particle filtering in joint-state space with occlusion reasoning. Experimental results demonstrate that our algorithm can automatically detect and track objects, effectively deal with partial occlusions, as well as possess high efficiency and stat e-of-the-art tracking accuracy. Acknowledgement. This work is supported by National Natural Science Foundation of China (Grant No. 60736018, 60723005), National Hi-Tech Research and Development Program of China (2009AA01Z318), National Science Founding (60605014, 60875021).
References 1. Reid, D.: An algorithm for tracking multiple targets. IEEE Trans. Automatic Control 24, 84–90 (1979) 2. Bar-Shalom, Y., Fortmann, T., Scheffe, M.: Joint probabilistic data association for multiple targets in clutter. In: Proc. Conf. Information Sciences and Systems (1980) 3. Cox, I., Leonard, J.: Modeling a dynamic environment using a bayesian multiple hypothesis approach. Artificial Intelligence 66, 311–344 (1994) 4. Rasmussen, C., Hager, G.: Probabilistic data association methods for tracking complex visual objects. IEEE Trans. on PAMI 23, 560–576 (2001) 5. Khan, Z., Balch, T., Dellaert, F.: Multitarget tracking with split and merged measurements. In: Proc. of CVPR 2005 (2005) 6. Isard, M., MacCormick, J.: Bramble: A bayesian multipleblob tracker. In: Proc. of ICCV 2001 (2001) 7. Okuma, K., Taleghani, A., de Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: Multitarget detection and tracking. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 28–39. Springer, Heidelberg (2004)
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
81
8. Vermaak, J., Doucet, A., Perez, P.: Maintaining multi-modality through mixture tracking. In: Proc. of ICCV 2003 (2003) 9. Yu, T., Wu, Y.: Decentralized multiple target tracking using netted collaborative autonomous trackers. In: Proc. of CVPR 2005 (2005) 10. Qu, W., Schonfeld, D., Mohamed, M.: Real-time interactively distributed multiobject tracking using a magnetic- inertia potential model. In: Proc. of ICCV 2005 (2005) 11. Khan, Z., Balch, T., Dellaert, F.: Mcmc-based particle filtering for tracking a variable number of interacting targets. PAMI 27, 1805–1819 (2005) 12. Yang, M., Yu, T., Wu, Y.: Game-theoretic multiple target tracking. In: Proc. of ICCV 2007 (2007) 13. Zhang, X., Hu, W., Li, W., Qu, W., Maybank, S.: Multi-object tracking via species based particle swarm optimization. In: ICCV 2009 WS on Visual Surveillance (2009) 14. Hess, R., Fern, A.: Discriminatively trained particle filters for complex multi-object tracking. In: Proc. of CVPR 2009 (2009) 15. Breitenstein, M., Reichlin, F., Leibe, B., Koller-Meier, E., Gool, L.: Robust tracking-by-detection using a detector confidence particle filter. In: ICCV 2009 (2009) 16. Li, M., Chen, W., Huang, K., Tan, T.: Visual tracking via incremental self-tuning particle filtering on the affine group. In: Proc. of CVPR 2010 (2010) 17. Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. Journal of IEEE Trans. on Signal Processing 50, 174–188 (2002) 18. Vijayakumar, S., Souza, A.D., Schaal, S.: Incremental online learning in high dimensions. Neural Computation 17, 2602–2634 (2005) 19. Li, M., Zhang, Z., Huang, K., Tan, T.: Pyramidal statistics of oriented filtering for robust pedestrian detection. In: Proc. of ICCV 2009 WS on Visual Surveillance (2009) 20. Grabner, H., Bischof, H.: On-line boosting and vision. In: Proc. of CVPR 2006 (2006) 21. Dalal, N., Triggls, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 22. Hastings, W.: Monte carlo sampling methods using markov chains and their applications. Biometrika 57, 97–109 (1970) 23. PETS09dataset, http://www.cvg.rdg.ac.uk/PETS2009/a.html 24. PETS04dataset, http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/
Modeling Complex Scenes for Accurate Moving Objects Segmentation Jianwei Ding, Min Li, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, P.R. China {jwding,mli,kqhuang,tnt}@nlpr.ia.ac.cn Abstract. In video surveillance, it is still a difficult task to segment moving object accurately in complex scenes, since most widely used algorithms are background subtraction. We propose an online and unsupervised technique to find optimal segmentation in a Markov Random Field (MRF) framework. To improve the accuracy, color, locality, temporal coherence and spatial consistency are fused together in the framework. The models of color, locality and temporal coherence are learned online from complex scenes. A novel mixture of nonparametric regional model and parametric pixel-wise model is proposed to approximate the background color distribution. The foreground color distribution for every pixel is learned from neighboring pixels of previous frame. The locality distributions of background and foreground are approximated with the nonparametric model. The temporal coherence is modeled with a Markov chain. Experiments on challenging videos demonstrate the effectiveness of our algorithm.
1
Introduction
Moving object segmentation is fundamental and important for intelligent video surveillance. The accuracy of segmentation is vital to object tracking, recognition, activity analysis, etc. Traditional video surveillance systems use stationary cameras to monitor scenes of interest. The usual approach for this kind of system is background subtraction. In many video surveillance applications, the background of scene is not stationary. It may contain waving trees, fountains, ocean ripples, illumination changes, etc. It is a difficult task to design a robust background subtraction algorithm that can extract moving object accurately from complex scenes. Many background subtraction methods have been proposed to address this problem. Stauffer and Grimson [1] used Gaussian mixture model (GMM) to account for the multimodality of color distribution for every pixel. Because of its high efficiency and good accuracy for most conditions, GMM has been used widely. And many researchers have proposed improvements and extensions of the basic GMM method, such as [2–4]. But an implicit assumption for GMMbased method is that the temporal variation of a pixel could be modeled by one or multiple Gaussian distribution. Elgammal et al. proposed Kernel density estimation (KDE) [5] to model color distribution, which does not have any R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 82–94, 2011. Springer-Verlag Berlin Heidelberg 2011
Modeling Complex Scenes for Accurate Moving Objects Segmentation
83
Fig. 1. Moving object segmentation in complex scenes using background subtraction. The first column shows the original images, the second column gives the corresponding segmentation using background subtraction.
underlying distribution. But the bandwidth of kernel is difficult to choose and the computation cost is high. Sheikh and Shah [6] incorporated spatial cue with temporal information into a single model. They got accurate segmentation result in a Maximum a posteriori-Markov Random Field (MAP-MRF) decision framework. Some other sophisticated techniques can be seen in [7–11]. However, there is a drawback of background subtraction. Thresholding is often used to segment foreground from background. Fig. 1 gives an example of moving objects segmentation in complex scenes using background subtraction. How to find an optimal segmentation is an energy minimization problem in MAP-MRF framework. Graph cuts [12] is one of the popular energy minimization algorithm. Boykov et al. [13] proposed an interactive image and video segmentation technique using graph cuts and got good results. This technique was then extended to solve other foreground/background segmentation problems. In [14], Layered Graph cuts (LGC) were proposed to extract foreground from background layers in stereo video sequences. In [15], Liu Feng et al. proposed an method which fused color, locality and motion information. Their method can segment moving objects with limited motion, but works off-line. Sun et al. [16] proposed background cut which combined background subtraction, color and contrast cues. Their method could work in real time. But static background was assumed in the initialization phase. In [17], multiple cues were probabilistically fused in bilayer segmentation of live video. In [18], the complex background was learned by multi-scale discriminative model. But training samples are needed in [17, 18]. We propose an online and unsupervised technique which learns complex scenes to find optimal foreground/background segmentation in a MRF framework. Our method absorbs the advantages of foreground/background modeling and Graph cuts algorithm. But our method works online and does not need prior information or training data. To improve the segmentation accuracy, color, locality, temporal coherence and spatial consistency are fused together in the framework. All these features are easy to get and calculate. We do not choose optical flow as [15, 17] because the reliable computation of optical flow is expensive and it suffers from aperture problem. The models of color, locality and temporal coherence are learned online from complex scenes. Color feature is the dominant one to discriminate foreground from background in our algorithm. To model the color distribution of background image precisely and adaptively, a novel mixture
84
J. Ding et al.
of nonparametric regional model and parametric pixel-wise model is presented to improve the robustness of algorithm under different background change. The foreground color model for every pixel in current frame is learned from the neighboring foreground pixels in previous frame. The locality distributions of background and foreground are also considered to reduce error. The distribution is approximated using a nonparametric model. Besides, an additional spatial uniform distribution is also considered in the model. The label of each pixel at current moment is not independent of that in previous time. This temporal coherence of labels is modeled with a first order Markov chain to increase accuracy. A simple strategy to calculate the transition probabilities is also provided. The remainder of this paper is organized as follows. Section 2 describes the general model for moving object segmentation. Section 3 gives the modeling of complex scenes. Section 4 presents the experiments and results. Section 5 makes conclusions of our work.
2
General Model for Moving Object Segmentation
In this section, we will describe the general model for moving object segmentation in a MAP-MRF framework.
Fig. 2. A simple Flowchart of the proposed algorithm
Moving object segmentation can be considered as a binary labeling problem. Let I be a new input image and P be the set of all pixels in I. The labeling problem is to assign a unique label lp ∈ {f oreground(= 1), background(= 0)} to each pixel p ∈ P. The maximum a posteriori estimation of L = {l1 , l2 · · · , l|P| } for I can be obtained by minimizing the Gibbs energy E(L) [13]: L∗ = arg min E(L)
(1)
The basic energy model E(L) used in [13] only considered color and spatial continuity cues, which are not sufficient [14]. The locality cue of a pixel is an
Modeling Complex Scenes for Accurate Moving Objects Segmentation
85
important spatial constraint that can help to judge the label value. Besides, the label of a pixel in current time is not independent of that in previous time. So the cue of temporal coherence is also needed to increase the accuracy for label decision. We extend the basic energy model and define E(L) as: E(L) = λc Ec (lp ) + λs Es (lp ) + λt Et (lp ) + EN (lp , lq ) (2) p∈P
p∈P
p∈P
p∈P q∈Np
where λc , λs and λt are factors that measure the weights of color term, locality term and temporal coherence term respectively. Np is the 4-connection or 8connection neighborhood of pixel p. Ec (lp ), Es (lp ), Et (lp ) and EN (lp , lq ) are described below. Color term Ec (lp ) is the color term of pixel p, denoting the cost of color vector cp when the label is lp . Ec (lp ) is defined as the negative logarithm of color likelihood: Ec (lp ) = − log P r(cp |lp ) (3) Locality term Es (lp ) is the locality term of pixel p, denoting the cost of locality vector sp when the label is lp . Es (lp ) is defined as the negative logarithm of locality likelihood: (4) Es (lp ) = − log P r(sp |lp ) Temporal coherence term Et (lp ) is the temporal coherence term of pixel p, denoting the cost of label lp in current time T when the label was lpT −1 in previous time T − 1. We use a first-order Markov chain to model the temporal coherence and define Et (lp ) as: Et (lp ) = − log P r(lpT |lpT −1 )
(5)
Spatial consistence term There is spatial consistence for neighboring pixels. They are more likely to have similar colors and same labels. EN (lp , lq ) is used to convey the spatial consistence between neighboring pixels p and q when the labels are lp and lq respectively. Similar to [13, 17], EN (lp , lq ) is expressed as an Ising term and defined as: EN (lp , lq ) = δlp =lq
+ exp(−βd2pq ) 1+
(6)
where δlp =lq is a Kronecker delta function. The constant is a positive parameter. β is also a parameter and set to β = (2 cp − cq )−1 , where · is the expectation over color samples of neighboring pixels in an image. dpq denotes the color difference between pixel p and q. To reduce computation cost, can be set to zero and β is taken as a constant, which is used in [13]. A simpler version of the Ising term can be seen in [6], which ignored the color difference. Among the four energy terms, only the spatial consistence term can be calculated directly from the new input image. But the color likelihood, the locality likelihood and the probability transition matrix should be known first to get the other three energy terms. They can be calculated by constructing color model, locality model and temporal coherence model. The three models are learned from the complex scenes and updated online.
86
J. Ding et al.
At the beginning of the algorithm, an initial background model is constructed using the color information of input images. The background model is updated quickly with high learning parameters until it is stable. After initialization, the flowchart of the algorithm is shown in Fig. 2. At each time T, the first step is modeling the scene and calculate the energy E(L). The second step is to minimize the energy E(L) and output the segmentation. To minimize the energy E(L), we use the implementation of [19]. The third step is to update the scene model with slow learning parameters.
3
Modeling the Complex Scenes
The first and most important step is modeling the scenes. As it may contain dynamic texture or various illumination changes in complex scenes, color, locality, temporal coherence and spatial consistence are fused together to increase the accuracy of segmentation. The spatial consistence model defined by Eq. (6) can be calculated directly on the input image. We only have to construct the color model, the locality model and the temporal coherence model. They will be described below. For simplicity, the update of the scene model is also included in this section. 3.1
Color Model
The color term is the dominant one among the four terms in Eq. (2). Different color features can be used in our algorithms, such as RGB, HSV, Lab. To discriminate foreground from background effectively, we choose RGB color features without color transformation. Let cp = (Rp , Gp , Bp )T be the color vector for each pixel p ∈ P. The modeling of color distribution includes three parts: modeling the background color distribution, modeling the foreground color distribution and updating the color model . Modeling the background color distribution. In [6], the background modeling methods are classified into two categories, which are methods that use pixel-wise model and methods that use regional model. The pixel-wise model is more precise than the regional model to measure the probability of pixel p belonging to background, but it is sensitive to spatial noise, illumination change and small movement of background. To model the color likelihood for each pixel p belonging to background, we define it as a mixture of regional model and pixel-wise model: P r(cp |lp = 0) = αc P rregion (cp |lp = 0) + (1 − αc )P rpixel (cp |lp = 0)
(7)
where αc ∈ [0, 1] is a mixing factor, and P rregion (cp |lp = 0) is the probability of observing cp at the neighbor region of pixel p, and P rpixel (cp |lp = 0) is the probability of observing cp at pixel p. The local spatial and temporal color information is used coherently in Eq. (7). Fig. 3 gives an comparatione of the three models. Our model can approximate the background color distribution precisely and is also robust to various spatial noise.
Modeling Complex Scenes for Accurate Moving Objects Segmentation
87
Fig. 3. Comparation of the three models. The three columns are color distributions of the pixel marked as red in the original image using the pixel-wise model, the regional model and the proposed mixture model, respectively.
The regional background color model can be learned from the local neighboring spatial-temporal volume Vp centered in pixel p with a nonparametric method: 1 P rregion (cp |lp = 0) = KH (cp − cm ) (8) |Vp | cm ∈Vp
where KH (x) = |H|−1/2 K(H−1/2 x) and K(x) is a d-variate kernel function and H is the bandwidth matrix. We select the commonly used Gaussian kernel, so 1 KH (x) = |H|−1/2 (2π)−1/2 exp(− xT H−1 x) (9) 2 To reduce computation cost, the bandwidth matrix H can be assumed to be the form of H = h2 I . The volume Vp is defined as Vp = {ct (u, v)||u − i| ≤ l, |v − j| ≤ l, T − Td < t < T }, where ct (u, v) is the color vector of pixel at the u-th row and v-th column of the image taken at time t, and Td is the period. Pixel p is at the location of i-th row and j-th column of the current input image at time T, and l is a constant value that measures the size of the local spatial region. The computation cost is very high to calculate Eq. (8) directly because of the large sample size. Besides, it is hard to choose proper kernel bandwidth H in high dimensional spaces even it has been simplified. We use the binned kernel density estimator to approximate Eq. (8). The accuracy and effectiveness of binned kernel density estimator has been proved in [20]. Similar strategy has also been used in [6, 7].
88
J. Ding et al.
We use a Gaussian mixture model as the pixel-wise model for every pixel p: K ωk N (cp |μk , Σk ) (10) P rpixel (cp |lp = 0) = k=1
where ωk , μk and Σk are the weight, the mean color vector and the covariance matrix of the k-th component respectively, and K is the number of components. To simplify computation, the covariance matrix is assumed to be of the form: Σk = σk2 I. Modeling the foreground color distribution. The color likelihood P r(cp |lp = 1) for each pixel p belonging to foreground at time T can be learned adaptively based on data from the neighboring foreground pixels in the previous frame at time T-1. Here there is an implicit assumption that a foreground pixel will remain in the neighboring area. Besides, at any time the probability of observing a foreground pixel at pixel p of any color is uniform. Thus, the foreground color likelihood can be expressed as a mixture of a uniform function and the kernel density function: 1 KH (cp − cm ) (11) P r(cp |lp = 1) = βc ηc + (1 − βc ) |Up | cm ∈Up
where βc ∈ [0, 1] is the mixing factor, and ηc is a random variable with uniform probability ηc (cp ) = 1/2563 . The set of neighboring foreground pixels in the previous frame is defined as Up = {ct (u, v)||u − i| ≤ w, |v − j| ≤ w, t = T − 1}, where w is a constant that measures the size of the neighboring region at pixel p, and pixel p is at the i-th row and j-th column of current image. The kernel density function is the same as Eq. (9). Updating the color model. The background color model is updated adaptively and online with the new input image after segmentation. For the nonparametric regional background color model, a sliding window of length lb is maintained. For the parametric pixel-wise model, similar maintenance strategy [1] is used here. If the new observation cp belongs to the k-th distribution, the parameters of the k-th distribution at time T is updated as follows: 2 σk,T
= (1 −
μk,T = (1 − ρ)μk,T −1 + ρcp
(12)
+ ρ(cp − μk,T ) (cp − μk,T ) ωk,T = (1 − ρ)ωk,T −1 + ρMk,T
(13) (14)
2 ρ)σk,T −1
T
where ρ is the learning rate. Mk,T is 1 when cp matches the k-th distribution and 0 otherwise. 3.2
Locality Model
It is not sufficient to use color alone if the color of foreground is similar to that of background. It can reduce error to take account of the location of moving objects. The spatial location of pixel p is denoted as sp = (xp , yp )T . As the shape of objects can be arbitrary, we use a nonparametric model similar to
Modeling Complex Scenes for Accurate Moving Objects Segmentation
89
[15] to approximate the spatial distribution of background and foreground. The difference is that we take account of the spatial uniform distribution for any pixel at any location. The locality distribution of background and foreground are defined as: 1 (sp − sm )T (sp − sm ) P r(sp |lp = 0) = αs ηs + (1 − αs ) max exp(− ) (15) 2πσ 2 m∈Vs 2σ 2 1 (sp − sm )T (sp − sm ) max exp(− ) (16) P r(sp |lp = 1) = αs ηs + (1 − αs ) 2πσ 2 m∈Us 2σ 2 where αs ∈ [0, 1] is the mixing factor, and ηs is a random variable with uniform probability ηs (sp ) = 1/M × N . Here M × N means the image size. Vs and Us are the sets of all background pixels and foreground pixels in the previous frame respectively. The parameter σ is the bandwidth of kernel. An example of locality distributions of background and foreground is shown in Fig. 4. From this figure we can see that the pixels near moving objects have higher probability belonging to foreground, and the pixels far away from moving object have higher probability belonging to background. 3.3
Temporal Coherence Model
For any pixel p ∈ P, if the label is foreground at time T-1, it is more likely for p to remain foreground than background at time T. So the label of pixel p at time T is not independent of labels at previous time. This is called the label temporal
Fig. 4. Locality distributions of background and foreground
Fig. 5. A simple example to show the temporal coherence of labels. The first image depicts the scene, the second image shows the foreground mask at time T-1, and the last image shows the foreground mask at time T. A small portion of pixels whose labels have changed are marked as red in the last image.
90
J. Ding et al.
coherency. Fig. 5 is an example to show the coherency. It can be modeled using a Markov chain. Here, a first order Markov chain is adopted to convey the nature of temporal coherence. There are four (22 ) possible transitions for the first order Markov chain model since there are only two candidate labels of pixel p. But there remain two degrees of freedom because there is relationship between P r(lpT = 1|lpT −1 ) and P r(lpT = 0|lpT −1 ), which is P r(lpT = 1|lpT −1 ) = 1 − P r(lpT = 0|lpT −1 ). We denote them as P rF F and P rBB , where P rF F denotes the probability of remaining foreground, and P rBB denotes the probability of remaining background. A simple strategy is described here to calculate P rF F and P rBB . They are both initialized to be 0.5. At time T-2, the number of foreground pixels and background pixels are denoted as nTF −2 and nTB−2 respectively. When there are detected moving objects, nTF −2 > 0 and nTB−2 > 0 . Then at time T-1 the number of pixels which remain foreground is nTF −1 and the number of pixels which remain background is nTB−1 . We can approximate P rF F and P rBB at time T by: P rFT F = T = P rBB
1 nT −1 αT + (1 − αT ) F 2 nTF −2 1 nT −1 βT + (1 − βT ) B 2 nTB−2
(17) (18)
where αT ∈ [0, 1] and βT ∈ [0, 1] are mixing factors. If nTF −2 = 0 or nTB−2 = 0, P rF F and P rBB remain the previous values.
4
Experiments and Results
We tested our algorithm on a variety of videos, which include fountain, swaying trees, camera motion, etc. The algorithm was implemented using C++, on a computer with 2.83 GHz Intel Core2 CPU and 2GB RAM, and achieved the processing speed of about 6 fps at the resolution of 240 × 360. Since GMM [1] and KDE [5] are two widely used background subtraction algorithms, we compare the performance of our method with them. To evaluate the effectiveness of the three algorithms, no morphological operations like erosion and dilation are used in the experiments. Besides, both qualitative analysis and quantitative analysis are used to demonstrate the effectiveness of our approach. Fig. 6 shows qualitative comparision result on outdoor scenes with fountain. The first column are frames 470, 580, 680 and 850 in the video. There are moving objects in the first, third and fourth images. The second column is the manual segmentation result. The third column shows the result of our method. The fourth and fifth columns are results of GMM and KDE respectively. The difficulty of moving objects extraction in this video is that the background changes continuously and violently. It is difficult to adapt the background model to the dynamic texture for the algorithms of GMM and KDE. But our method works well in the presence of fountain and gives accurate segmentation.
Modeling Complex Scenes for Accurate Moving Objects Segmentation
91
Fig. 6. Segmentation results in the presence of fountain. The first column shows the original images, the second column gives the manual segmentation, the third column presents the result of our method, the last two columns show the results of GMM and KDE respectively.
Fig. 7. Segmentation results in the presence of waving trees. The first column shows the original images, the second column gives the manual segmentation, the third column presents the result of our method, the last two columns show the results of GMM and KDE respectively.
Fig. 7 gives qualitative comparison result on outdoor scenes with waving trees. The video comes from [8]. Our method can detect most of the foreground pixels accurately and has little false detection. Fig. 8 gives segmentation on another two challenging videos by our method. The videos come from [10]. In the two original sequences, there are ocean ripples and waving trees respectively.
92
J. Ding et al.
Fig. 8. Segmentation results in complex scenes. The first column shows the original images which contain ocean ripples, the second column presents the corresponding result of our method. The third column shows the original images which contain waving trees, the fourth column presents the corresponding result of our method.
Fig. 9. Segmentation results in the presence of camera motion. The first column shows the original images, the second column gives the manual segmentation, the third column presents the result of our method, the last two columns show the results of GMM and KDE respectively.
In order to better demonstrate the performance of our algorithm, a quantitative analysis is provided here. The test video and corresponding ground-truth data come from [6]. This video contains average nominal motion of about 14 pixels.
Modeling Complex Scenes for Accurate Moving Objects Segmentation 1
1
0.9
0.8
0.8
0.7
0.7 0.6
0.6 0.5
Recall
Precision
0.9
Ours GMM with rate 0.005 GMM with rate 0.05 GMM with rate 0.1 KDE
0.4 0.3 0.2
0.5 0.4 0.3 0.2
0.1 0 250
93
Ours GMM with rate 0.005 GMM with rate 0.05 GMM with rate 0.1 KDE
0.1
300
350
frame number
400
450
500
0 250
300
350
frame number
400
450
500
Fig. 10. Precision and Recall comparisons of the three algorithms. The GMM is tested with different learning rate:0.005, 0.05, 0.1.
Fig. 9 shows a qualitative illustration of the results as compared to the ground truth, GMM and KDE. From this figure, we can see that our method can segment moving objects accurately and has almost no false detections in the presence of camera motion. Fig. 10 shows the corresponding quantitative comparison in terms of precision and recall. The precision and recall are defined as: N umber of true positives detected T otal number of positives detected N umber of true positives detected Recall = T otal number of true positives
P recision =
From Fig. 10 we can see that the detection accuracy both in terms of precision and recall is consistently higher than GMM and KDE. Several learning rates are selected to test for GMM. Though the precision and recall of our method are both very high, there are still some false positives and false negatives detected by our method. These pixels are mainly at the edge or contour of true objects. That is because spatial continuity information is used in the proposed method.
5
Conclusions
In this paper, we have proposed an online and unsupervised technique which learns multiple cues to find optimal foreground/background segmentation in a MRF framework. Our method adopts the advantages of foreground/background modeling and Graph cuts algorithm. To increase the accuracy of foreground segmentation, color, locality, temporal coherency and spatial continuity are fused together in the framework. The distributions of different features for background and foreground are modeled and learned separately. Our method can segment moving object accurately in complex scenes. That was proved by experiments on several challenging videos. Acknowledgement. This work is supported by National Natural Science Foundation of China (Grant No. 60736018, 60723005), National Hi-Tech Research and Development Program of China (2009AA01Z318), National Science Founding (60605014, 60875021).
94
J. Ding et al.
References 1. Stauffer, C., Grimson, W.: Learning pattern of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence 22, 747–757 (2000) 2. KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model for realtime tracking with shadow detection. In: Proc. European Workshop on Advanced Video Based Surveillance Systems (2001) 3. Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proc. Int. Conf. Pattern Recognition, pp. 28–31 (2004) 4. Lee, D.S.: Effective gaussian mixture learning for video background subtraction. IEEE Trans. Pattern Analysis and Machine Intelligence 27, 827–832 (2005) 5. Elgammal, A., Harwood, D., Davis, L.: Non-parametric model for background subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 751–767. Springer, Heidelberg (2000) 6. Sheikh, Y., Shah, M.: Bayesian modeling of dynamic scenes for object detection. IEEE Trans. Pattern Analysis and Machine Intelligence 27, 1778–1792 (2005) 7. Ko, T., Soatto, S., Estrin, D.: Background subtraction on distributions. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 276– 289. Springer, Heidelberg (2008) 8. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. In: Proc. ICCV, pp. 255–261 (1999) 9. Monnet, A., Mittal, A., Paragios, N., Ramesh, V.: Background modeling and subtraction of dynamic scenes. In: Proc. ICCV, pp. 1305–1312 (2003) 10. Li, L., Huang, W., Gu, I.Y.H., Tian, Q.: statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing 13, 1459–1472 (2004) 11. Pless, R., Larson, J., Siebers, S., Westover, B.: Evaluation of local models of dynamic backgrounds. In: Proc. CVPR, pp. 73–78 (2003) 12. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for markov random fields. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 16–29. Springer, Heidelberg (2006) 13. Boykov, Y., Jolly, M.: Interactive graph cuts for optimal boundary region segmentation of objects in n-d images. In: Proc. ECCV, pp. 105–112 (2006) 14. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Bi-layer segmentation of binocular stereo video. In: Proc. CVPR, pp. 1186–1193 (2005) 15. Liu, F., Gleicher, M.: Learning color and locality cues for moving object detection and segmentation. In: Proc. CVPR, pp. 320–327 (2009) 16. Sun, J., Zhang, W., Tang, X., Shum, H.-Y.: Background cut. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 628–641. Springer, Heidelberg (2006) 17. Criminisi, A., Cross, G., Blake, A., Kolmogorov, V.: Bilayer segmentation of live video. In: Proc. CVPR, vol. 1, pp. 53–60 (2006) 18. Zha, Y., Bi, D., Yang, Y.: Learning complex background by multi-scale discriminative model. Pattern Recognition Letters 30, 1003–1014 (2009) 19. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Analysis and Machine Intelligence 26, 1124–1137 (2004) 20. Hall, P., Wand, M.: On the accuracy of binned kernel estimators. J. Multivariate Analysis (1995)
Online Learning for PLSA-Based Visual Recognition Jie Xu1,2 , Getian Ye3 , Yang Wang1,2 , Wei Wang1,2 , and Jun Yang1,2 1
National ICT Australia University of New South Wales 3 Canon Information Systems Research Australia {jie.xu,yang.wang,jun.yang}@nicta.com.au,
[email protected],
[email protected] 2
Abstract. Probabilistic Latent Semantic Analysis (PLSA) is one of the latent topic models and it has been successfully applied to visual recognition tasks. However, PLSA models have been learned mainly in batch learning, which can not handle data that arrives sequentially. In this paper, we propose a novel on-line learning algorithm for learning the parameters of PLSA. Our contributions are two-fold: (i) an on-line learning algorithm that learns the parameters of a PLSA model from incoming data; (ii) a codebook adaptation algorithm that can capture the full characteristics of all the features during the learning. Experimental results demonstrate that the proposed algorithm can handle sequentially arriving data that batch PLSA learning cannot cope with, and its performance is comparable with that of the batch PLSA learning on visual recognition.
1
Introduction
Visual recognition involves the task of automatically assigning visual data to the semantic categories. Recently probabilistic latent topic models such as probabilistic Latent Semantic Analysis (PLSA) [1], Latent Dirichlet Allocation (LDA) [2] and their extensions have been applied to the visual recognition tasks. The PLSA has been successfully employed on the tasks of object categorization [3], scene recognition [4], and also action recognition [5], [6]. However, PLSA has the following disadvantages. Firstly, the conventional learning for PLSA is batch learning, and hence it can not handle data that arrives sequentially. Secondly, PLSA relies on a codebook of words for the “bag-of-words” representation. For visual recognition, k-means clustering algorithm is usually employed to produce a codebook of visual words [3],[5],[4]. An optimal codebook can be constructed by taking into account of all the training features. However, it would be infeasible to apply the same algorithm on large datasets with excessive features. In practice, only a subset of features is chosen for the clustering, and it results in a suboptimal codebook. In order to address the aforementioned problems, we propose an on-line learning algorithm for learning the parameters of PLSA model from the incoming data. The contributions of this paper are two-fold: (i) R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 95–108, 2011. Springer-Verlag Berlin Heidelberg 2011
96
J. Xu et al.
an on-line learning algorithm that learns the parameters of PLSA model from incoming data; (ii) a codebook adaptation algorithm that can capture the full characteristics of all the features during the on-line learning. Related Work. Using the proposed on-line algorithm, the trained model can adapt to new data without re-using the previous training data. Incremental EM is employed to speed up the learning of PLSA in [7]. However, the algorithm is not an on-line one since previous training data has to be re-used, together with the new data, for model adaptation. Previous work on on-line learning of PLSA and other latent models can be found in the text mining area [8], [9], [10]. However, all these algorithms are proposed for text mining only, and there is no visual codebook and its adaptation involved. The most related work to our algorithm is the QB-PLSA proposed by Chien et al. in [8]. QB-PLSA employs the QB estimation [11] to enable the on-line learning of PLSA, which represents the priors of PLSA parameters using Dirichlet densities. Our algorithm, on the other hand, employs the on-line EM for stochastic approximation of PLSA, and makes no assumptions of the parameter distributions. To the best of our knowledge, our work proposes the first on-line PLSA algorithm for vision problems. We apply the proposed algorithm to two visual recognition tasks, i.e., scene recognition and action recognition. The task of scene recognition is referred to as the classification of semantic categories of places using visual information collected from scene images. The work in the literature can be classified into one of the two following approaches. The first approach considers global or local feature-based image representation [12],[13],[14], whereas the second one relies on the intermediate concept-based representation [15],[16],[4],[17],[18]. Among the work from the second category, some models learn the concepts in a supervised manner [15],[17], while the others learn without supervision [16],[4],[18]. The PLSA-based scene recognition belongs to the second category, and it has been shown by Bosch et al. [4] to outperform previous models [15],[16]. As for the action recognition, it is an important task in video surveillance. Numerous techniques have been proposed for action recognition in the literature. A popular approach is based on the analysis of motion cues in human actions [19],[20]. Despite its popularity, the robustness of this type of approach depends heavily on the tracking systems, especially under complex situations in videos. To handle complex situations, another type of approach attempts to build dynamic human action models [21],[22],[23]. However, the large number of parameters from the models results in high model complexities. To avoid the model complexities, local feature-based video representation is employed. Salient features are extracted from the videos, and videos are then converted into intermediate representations based on the features, before being fed into different classifiers [24],[25],[26],[27]. The PLSA has been employed for action recognition based on the “bag-of-words” representation using spatial-temporal interest points [24], and it has been shown to perform well on this task [5]. The experimental results of our proposed algorithm show that the proposed algorithm can handle large datasets that batch PLSA learning cannot cope with, and its performance is comparable with that of the batch PLSA learning on visual recognition.
Online Learning for PLSA-Based Visual Recognition
97
The rest of the paper is organized as follows: Section 2 reviews the batch learning algorithm for PLSA-based visual recognition. Our proposed algorithm is described in Section 3, followed by the experimental results in Section 4. Our final conclusions are summarized in Section 5.
2
PLSA-Based Visual Recognition
The PLSA was originally proposed for text modeling, in which words are considered as the elementary building units for documents. Rather than modeling documents as sets of words, PLSA models each document as a mixture of latent topics, and each topic is characterized by a distribution over words. In the context of visual recognition, the visual features are considered as words. Denote zk ∈ Z = {z1 , z2 , ..., zK } as the latent topic variable that associates the words and documents, and X as the training co-occurrence matrix consists of co-occurrence counting of (di , wj ), collected from N visual documents D = {d1 , d2 , ..., dN }, based on a codebook of V visual words W = {w1 , w2 , ..., wV }. The joint probability of the visual document di , the hidden topic zk and the visual word wj is assumed to have the following form [1]: p(di , zk , wj ) = p(wj |zk )p(zk |di )p(di ),
(1)
where p(wj |zk ) represents the probability of the word wj occurring in the hidden topic zk , p(zk |di ) is the probability of the topic zk occurring in the document di , and p(di ) can be considered as the prior probability of di . The joint probability of observation pair (di , wj ) can be generated by marginalizing over all the topic variables zk : p(zk |di )p(wj |zk ). (2) p(di , wj ) = p(di ) k
Let n(di , wj ) be the number of occurrences of the word wj in the document di , and all of them constitute the co-occurrence matrix X. The prior probabil ity p(di ) can be computed as p(di ) ∝ j n(di , wj ). The parameters of PLSA then Kdistributions over K latent variables, and such that V contain multinomial p(w |z ) = 1 and j k j=1 k=1 p(zk |di ) = 1. We represent these KV + KN probabilities using θ = {{p(wj |zk )}j,k , {p(zk |di )}k,i }. The log likelihood of X given θ can be written as L(X|θ) =
N V
n(di , wj ) log p(di , wj ).
(3)
i=1 j=1
A maximum likelihood estimation of θ is obtained by maximizing the log likelihood using an Expectation Maximization (EM) algorithm [28]. Starting from an initial estimate of θ, the expectation (E) step of the algorithm computes the following expectation function ˆ = Ez [log P (X, Z|θ)|X, ˆ Q(θ|θ) θ] ∝
N V i=1 j=1
n(di , wj )
K k=1
p(zk |di , wj ) log[p(zk |di )p(wi |zk )].
98
J. Xu et al.
The maximization (M) step then calculates the new estimate θˆ by maximizing ˆ computed at E-step. The EM algorithm alterthe expectation function Q(θ|θ) nates between both steps until the log likelihood converges. It is shown in [1] that, the E step is equivalent to computing the posterior probability p(zk |di , wj ), given the estimated p(wj |zk ), p(zk |di ): p(wj |zk )p(zk |di ) p(zk |di , wj ) = K , l=1 p(wj |zl )p(zl |di )
(4)
while the M step corresponds to the following updates, given the computed p(zk |di , wj ) from the E-step: new
p(wj |zk )
N
i=1 n(di , wj )p(zk |di , wj ) , N m=1 i=1 n(di , wm )p(zk |di , wm )
= V
(5)
V
new
p(zk |di )
j=1 n(di , wj )p(zk |di , wj ) = K V . l=1 j=1 n(di , wj )p(zl |di , wj )
(6)
In (6), the denominator can also be written as n(di ), i.e., the total number of words occurring in di . During the inference stage, given a testing visual document dtest , the topicbased intermediate representation p(zk |dtest ) are computed by using the “foldin” heuristic described in [1]. The heuristic employs the EM algorithm in the same way as in the learning, while fixing the values of coefficients p(wj |zk ) obtained from the training stage.
3
Proposed Algorithm
We are to present our algorithm for PLSA under the setting of on-line learning, where training data is assumed to arrive at successive time slices. De(t) (t) (t) note D(t) = {d1 , d2 , ..., dNt } as the data stream arriving at time t. and (t)
(t)
(t)
F (t) = {f1 , f2 , ..., fNf } as the feature descriptors from D(t) . Let W (t) = t
{w1 , w2 , ..., wVt } be the codebook of visual words at different time slice t, X (t) be the co-occurrence matrix at time t. The goal of our algorithm is to learn the parameters of PLSA model from data streams. 3.1
On-line Codebook Adaptation for PLSA
To train a PLSA model for visual recognition, a codebook of visual words is required. The k-means clustering algorithm is usually employed to produce the codebook using all the features in the training set. Niebles et al. show that a bigger sized codebook often results in a higher recognition accuracy [5]. The size of the codebook is usually proportional to the number of features. However in practice, due to memory limitations, only a limited number of features are considered for the codebook construction when dealing with a large dataset.
Online Learning for PLSA-Based Visual Recognition
99
Algorithm 1. On-line Codebook Adaptation 1: INPUTS: (t) (t) (t) 2: F (t) = {f1 , f2 , ..., fNf } - feature vectors in the data stream at time slice (t) t
3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
W (t−1) = {w1 , w2 , ..., wV t−1 } - the codebook from time slice (t − 1) Σ (t−1) = {(μ1 , σ1 ), (μ2 , σ2 )..., (μVt−1 , σV t−1 )} - the distance statistics for W (t−1) ¯ = {(¯ Σ μ, σ ¯ 2 )} - the mean distance statistics computed from the initial set β - the learning factor OUTPUTS: W (t) = {w1 , w2 , ..., wVt } - updated codebook for time slice (t) Σ (t) = {(μ1 , σ1 ), (μ2 , σ2 )..., (μVt , σVt )} - the distance statistics for W (t) for each time slice t ≥ 1 do Vt ← Vt−1 (t) for each feature fn do choose a codeword wk , where k = arg mini ( (t)
(t)
fn −wi 2 2 ) σi2
15: d ← fn − wk 22 16: if d ≤ μk + 2.5σk then 17: μk ← μk + β(d − μk ) 18: σk2 ← σk2 + β(d − μk )(d − μk ) (t) 19: wk ← wk + β(fn − wk ) 20: else (t) ¯ , σV2 t ← σ ¯2 21: Vt ← Vt + 1, wVt ← fn , μVt ← μ 22: end if 23: end for 24: end for
The produced codebook is not necessarily optimal as it only considers a subset of features. In order to construct a codebook that captures the full characteristics of all features in the data, we present here an codebook adaptation algorithm for the codebook construction. (t) Given a feature vector fn , and the corresponding wk , denote δ as the dif(t) (t) ference vector between fn and wk , i.e., δ = fn − wk . We simply assume that each dimension of δ follows an i.i.d. zero mean Gaussian distribution, and hence (t) the squared Euclidean distance between fn and wk can be approximately modeled by a Gaussian distribution [29]. The statistics μk and σk are the mean and standard deviation for the squared distances between wk and all the features assigned to wk . The codebook adaptation algorithm is presented in Algorithm 1. As shown in Algorithm 1, the inputs include the current feature set F (t) , the previous codebook W (t−1) , and also the distance statistics Σ (t−1) for W (t−1) . The Σ (t−1) consists of the corresponding means and variances for the squared distances between the features and corresponding codewords. During the adaptation process, (t) each feature chooses one codeword to update. For each feature vector fn , we (t) 2 f −w select the codeword wk that has the minimum of n σ2 k 2 . Let d be the squared k
100
J. Xu et al. (t)
distance between fn and wk . If the difference between d and μk is within 2.5 (t) σk , both wk and (μk , σk ) will be updated using fn . The updating factor β is (t) set to be a flat rate (e.g., β = 0.05). If no match is found between fn and all (t) the codewords, a new codeword is then initiated using fn . The related distance 2 statistics are initiated using μ ¯, and σ ¯ , which are the corresponding mean values computed from the initial training set. 3.2
On-Line EM for PLSA Learning
The learning of PLSA model in Section 2 employs the batch EM for parameter estimation. However, batch EM is not applicable to data that arrives sequentially. To enable the learning of PLSA under this situation, we propose an on-line learning algorithm for PLSA. The idea of the proposed algorithm is to update the model parameter θ using D(t) received at each time slice. For a better understanding of the algorithm, we begin the on-line learning using the notations for batch learning in Section 2. To enable the on-line learning for PLSA, we firstly define two statistics n(d, wj )zk and n(d, w)zk in the follows: n(d, wj )zk =
N
n(di , wj )p(zk |di , wj ), j ∈ {1, 2, ..., V }, k ∈ {1, 2, ..., K}. (7)
i=1
n(d, w)zk =
V N
n(di , wj )p(zk |di , wj ), k ∈ {1, 2, ..., K}.
(8)
j=1 i=1
Based on the above definitions, (5) in M-step can be re-formulated as: N n(di , wj )p(zk |di , wj ) n(d, wj )zk p(wj |zk ) = V i=1 . = N n(d, w)zk m=1 i=1 n(di , wm )p(zk |di , wm )
(9)
The main idea of our proposed on-line EM is to replace the values of n(d, wj )zk and n(d, w)zk with weighted mean values. We define the weighted mean for n(d, w)zk at time t as follows: n(d, w)(t) zk = η(t)
t
((
τ =1
t
s=τ +1
λ(s))
Nτ V
(τ )
(τ )
n(di , wj )p(τ ) (zk |di , wj ))
(10)
i=1 j=1
(t) where η(t) = ( tτ =1 ts=τ +1 λ(s))−1 , and p(t) (zk |di , wj ) represents the posterior probability computed using visual document stream D(t) and W . Here, the parameter λ(s) (0 ≤ λ(s) ≤ 1, s = 1, 2, 3, ...) is a time-dependent decaying factor that determines the contribution of the data stream at each time slice. Equation (10) can be simplified as (please refer to the appendix for details of derivation): Nt V (t) (t) (t−1) = (1 − η(t))n(d, w) + η(t)[ n(di , wj )p(t) (zk |di , wj )]. n(d, w)(t) zk zk i=1 j=1
(11)
Online Learning for PLSA-Based Visual Recognition
101
Algorithm 2. On-line EM Algorithm for PLSA Learning 1: INITIAL STAGE (0) (0) 2: estimate n(d, w)zk , n(d, wj )zk from the initial training set. 3: construct the initial codebook W (0) using F (0) from the initial training set 4: ON-LINE LEARNING STAGE 5: for each image stream at time slice t do 6: Codebook Adaptation: 7: compute W (t) using F (t) and W (t−1) based on Algorithm 1 (t) 8: compute X (t) = {n(di , wj )} with W (t) and F (t) 9: On-line EM Parameter Estimation: (t) 10: initialize p(t) (wj |zk ), p(t) (zk |di ) based on W (t) 11: for each EM iteration n do 12: E-step: (t)
(t)
13: 14:
compute p(t) (zk |di , wj ) using p(t) (wj |zk ) and p(t) (zk |di ) with (4) M-step:
15: 16:
computen(d,w)zk using X (t) , n(d,w)zk , and p(t)(zk |di , wj ) with (11) (t) (t−1) (t) compute n(d, wj )zk using X (t) , n(d, wj )zk , and p(t) (zk |di , wj ) with (12) (t) (t) compute p(t) (wj |zk ) using n(d, w)zk and n(d, wj )zk with (13) (t) (t) compute p(t) (zk |di ) using X (t) and p(t) (zk |di , wj ) with (6) Check convergence:
17: 18: 19: 20: 21: 22: 23: 24: 25:
(t)
(t−1)
(t)
(t)
compute L(n) (X (t) |θ) using p(t) (wj |zk ) and p(t) (zk |di ) with (3) (n) (n−1) |< if new likelihood L(n) satisfies convergence condition | L L−L (n−1) then terminate EM iteration end if end for end for (t)
Similarly, n(d, wj )zk can also be derived as: Nt (t) (t) (t−1) n(d, wj )(t) + η(t)[ n(di , wj )p(t) (zk |di , wj )]. zk = (1 − η(t))n(d, wj )zk i=1
(12) Using (11), and (12), the on-line estimations for p(wj |zk ) at time t can be written as follows: (t)
p(t) (wj |zk ) =
n(d, wj )zk (t)
n(d, w)zk
(13)
. (t)
(t)
We can see from (11) and (12) that the values of n(d, w)zk and n(d, wj )zk (t−1) (t−1) are obtained by combining the previous values n(d, w)zk and n(d, wj )zk (t) with the current data D using different weights respectively. The weights are controlled by the normalization coefficient η(t) (1/t ≤ η(t) ≤ 1) which is used as a learning rate. Sato et al. pointed out in [30] that the on-line EM is a stochastic
102
J. Xu et al.
approximation. To guarantee the estimate converges to a local maximum of the likelihood function of the data, η(t) must satisfy the following conditions: lim η(t) = 0,
t→∞
∞ t=1
η(t) = ∞,
∞
η 2 (t) < ∞.
(14)
t=1
We employ the (1/t)-schedule as it is designed to match these conditions. Under this schedule, η(t) is set to be 1/t, which means each stream is weighted equally, i.e., λ(s) = 1. To combine this on-line algorithm with the codebook adaptation in Section 3.1, η(t) is set to 1 for any newly created codeword wj , since there is no past experience for the new codeword to learn from. An overview of the on-line learning algorithm with codebook adaptation is pre(0) (0) sented in Algorithm 2. The initial parameter estimates n(d, wj )zk , n(d, w)zk , are estimated from an initial training set D(0) using standard EM of PLSA. During the on-line learning stage, our algorithm performs the codebook adaptation using F (t) from data stream D(t) at each time slice t, and then conducts the on-line learning based on the updated codebook W (t) . As a result, the proposed algorithm can automatically and adaptively learn from data streams over the entire learning life.
4
Experiments
We evaluate the performance of the proposed algorithm for both scene recognition and human action recognition. All the experiments are conducted on a PC with an Intel 2.09 GHz Core Duo CPU and 1GB of RAM. We use the Matlab implementation of PLSA code provided by Fergus et al. [31]. We compare the on-line learning and batch learning in the experiments. The Picasa Scene Data. To demonstrate the advantages of the proposed algorithm, we conduct our first experiment based on the web images. We issue queries to the Google Picasa Web Album using the provided API [32], and receive the images sequentially. All the collected images are resized to 320-by-256 pixels. Sample images can be found in the supplementary material. We repeat the experiments for 5 runs, and the average recognition performance is used for comparison. At each run, we firstly collect 238 images as the initial training set, and also 238 images as the testing set. We then collect different number streams of images for the on-line learning. We use 2, 3, and 4 streams. We collected 4743 images from 7 different scene categories in total at each run. For feature extraction, we use the SIFT descriptor [33] as the local feature descriptor. Feature points are densely sampled on regular grids from each image, and 80 SIFT descriptors are extracted per image. We adopt the scene recognition approach using PLSA and k-NN classifier proposed in [4] in the experiments. For the PLSA parameters, the codebook size is set to 1300, and the number of topics is fixed to 25. The parameter k for the k-NN classification is set to 17.
Online Learning for PLSA-Based Visual Recognition
65.5
63.5 63 62.5 62 0
1 Streams
(a) 2 Image Streams
2
65
68
With codebook adaptation Without codebook adaptation
Recognition rate (%)
64
With codebook adaptation Without codebook adaptation
Recognition rate (%)
Recognition rate (%)
64.5
64.5 64 63.5 63 0
1
Streams
2
(b) 3 Image Streams
3
103
With codebook adaptation Without codebook adaptation
67 66 65 64 63 0
1
2 Streams
3
4
(c) 4 Image Streams
Fig. 1. The performance of on-line learning algorithm based on the Google Picasa Web Album
For the batch learning, we intend to learn a batch model using all the images from both the initial set and the on-line training set, which is 4505 in total. Let |D|, |W |, and |Z| be the number of images, the codebook size, and the number of topics. The memory requirement for our Matlab based implementation is 64 ∗ |D||W ||Z| bytes (In this experiment, |D|, |W |, and |Z| are 4505, 1300, and 25 respectively). This memory requirement exceeds the Matlab memory capacity on our computer, and therefore the conventional batch PLSA learning cannot be performed on our system. Compared with the batch learning algorithm, the proposed algorithm has the advantages of lower memory consumption since only a subset of the training data is processed at each time. At each run of the on-line learning, an initial model is learned using the batch learning algorithm on the initial training set. An initial codebook is also obtained by clustering all the features from the initial set. We then update the initial model using image streams. After each update, the topic vectors for the images from both the initial set and the testing set are re-computed using the updated model. The category of each testing image is determined by the k-NN search in the initial set. To see the effect of codebook adaptation, the experiments are conducted using the on-line learning algorithms with and without codebook adaptation respectively. Figure 1 reveals the overall trend of recognition rates throughout the on-line learning. It can be seen from Figure 1 that, codebook adaptation helps to improve the recognition performance under different stream size settings. Table 1 summarizes the experimental results, and the highest accuracy of 67.20% is obtained by the algorithm with codebook adaptation. Table 1. Summary of performance on the Picasa scene dataset. Results are reported as the average accuracies of 5 runs. #streams 2 3 4 Without codebook adaptation 64.37% 64.00% 64.02% With codebook adaptation 65.54% 65.42% 67.20% Batch Learning N/A
104
J. Xu et al.
Mean accuracies 70
68.5
Recognition rate (%)
Recognition rate %
69 68 67 66 65 64
68
Proposed QB−PLSA
67.5 67 66.5 66
63 62 1
3
5
7
9
11
#Neighbours
(a)
13
15
17
19
65.5 0
1
Streams
2
3
(b)
Fig. 2. The recognition performance on the OT scene dataset. (a) The average recognition rates for different k’s of the k-NN algorithm on the testing set, based on the batch-trained PLSA model. (b) The performance of both the proposed algorithm and QB-PLSA.
The OT Scene Dataset. In our second experiment, we compare the proposed algorithm with the conventional batch PLSA, and also the QB-PLSA proposed in [10], using the OT scene dataset [34]. The OT dataset contains 2688 images from 8 different scene categories, and they are 360 coasts, 328 forest, 260 highway, 308 inside of cities, 374 mountain, 410 open country, 292 streets, and 356 tall buildings. The average size of each image is 256-by-256 pixels. For feature extraction, the SIFT descriptor [33] is employed as the local feature descriptor. We again use the scene recognition approach using PLSA and kNN classifier in this experiment, by following the settings detailed in [35]. The number of topics for PLSA is set to 25, and the codebook size is set to 650. We optimize the k for the k-NN classifier in the experiments. The experiments are repeated for 5 runs, and the average recognition performance is used for comparison. At each run, the dataset is randomly divided into 538 initial training images, 1612 on-line training images, and also 538 testing images. No labeling information is used in the random division. The on-line training set is further divided into 3 streams to test the on-line learning. At each run of the batch learning, a PLSA model is learned using all the images from both the initial set and the on-line training set, and tested using the testing set. The average performance is used for comparison. Figure 2a depicts the average recognition rate for different k’s of the k-NN algorithm on the testing set. It can be seen from the figure that, the best recognition rate of 69.52% for the batch-learned PLSA is obtained when k = 19. This figure is consistent with the testing accuracy reported in [35], based on the similar settings. As a result, k = 19 for the k-NN algorithm will serve in the following sections as a baseline for the evaluation of the proposed algorithm. At each run of the on-line learning, we compare the proposed algorithm with the QB-PLSA. QB-PLSA is proposed for text mining, and there is no codebook adaptation involved. To enable QB-PLSA for scene recognition, the initial codebook generated from the initial training set is used as the visual codebook
Online Learning for PLSA-Based Visual Recognition
105
for QB-PLSA. Compared with the QB-PLSA, the proposed algorithm employs the codebook adaptation to capture the feature characteristics in the new data; it is more flexible as there is no assumption made on the distributions of the PLSA parameters. The average recognition performances for both on-line algorithms are demonstrated in Figure 2b. It can be observed from Figure 2b that, the proposed algorithm achieves better recognition performance than QB-PLSA throughout the entire learning. Table 2 summarizes the recognition accuracies of all learning algorithms, and it shows that the performance of the proposed algorithm is comparable with that of the batch-learned PLSA model. Table 2. Summary of performance on the OT scene dataset. Results are reported as the average accuracies of 5 runs. Algorithms Accuracies QB-PLSA 65.95% Proposed 68.00% Batch Learning 69.53%
The KTH Action Dataset In our last experiment, we use the KTH human action dataset [36] for evaluation. The KTH dataset is a single-view video dataset in human actions, which consists of 598 action videos. Six categories of human actions can be found in this dataset, and they are boxing, jogging, running, walking, hand waving and hand clapping. Each of the action is performed several times by 25 subjects in different environments with scale changes. There are only a single action in each video. For feature extraction, we choose the separable linear filter in [24] for spacetime interest point extraction in videos. Each extracted space-time patch is described as a concatenated vector of its brightness gradient. Then all the descriptors are projected to 100 dimensions using PCA. The codebook size is set to 1000 for the KTH dataset. The experiments are conducted based on the leave-one-out testing paradigm (LOO). Since the KTH dataset contains a large number of features, only a subset of features are chosen to construct the batch codebook [5], and we construct With Codebook Adaptation Without Codebook Adaptation
80.5 80 79.5 79 78.5 78 77.5 77 0
1
Streams
(a) 2 Video Streams
2
79.5
79.5
79
79
78.5 With Codebook Adaptation Without Codebook Adaptation
78 77.5
Recognition rate (%)
Recognition rate (%)
81
Recognition rate (%)
81.5
78.5 78 77.5
77
77
76.5 0
76.5 0
1
Streams
2
(b) 3 Video Streams
3
With Codebook Adaptation Without Codebook Adaptation
1
2
Streams
3
4
(c) 4 Video Streams
Fig. 3. The performance of on-line learning algorithm on the KTH action dataset
106
J. Xu et al.
Table 3. Summary of performance on the KTH action dataset. Results are reported as the averages of 25 runs. #streams 2 3 4 Without codebook adaptation 77.70% 77.36% 77.20% With codebook adaptation 81.00% 79.06% 79.33% Batch Learning 80.17%
the codebook by using videos of only three subjects. After the codebook construction, a PLSA model is learned from the videos of 24 subjects using batch learning algorithm, and is tested against the videos of remaining subject. We repeat the LOO batch learning for 25 runs. At each LOO run of the on-line learning, all the videos of one subject are used as the initial training set. The initial training set is used to construct the initial codebook, and also to train the initial model, using batch PLSA learning algorithm. The remaining training videos form the on-line learning set for adapting the model. The on-line training set is randomly split into different number of video streams without using the labeling information, and we use 2, 3, and 4 video streams in the experiments. After learning from each video stream, the adapted model is tested against the testing videos. We evaluate both on-line learning algorithms with and without codebook adaptation. A confusion matrix is computed for each testing, and the recognition accuracies for the trained model are reported as the mean values of diagonal elements from the average confusion table of 25 runs. The performances of the proposed algorithm under different streams settings are depicted in Figure 3. Figure 3 also shows the codebook adaptation helps to improve the recognition performance for the on-line learning. Table 3 summarizes the performance of batch learning and on-line learning. It can be observed from the table that our batch-trained PLSA achieves the average accuracy of 80.17%, which is consistent with that reported in [5]. The table reveals that the performance of the proposed algorithm is comparable with that of the conventional batch algorithm on this dataset. Finally we compare the proposed algorithm with the QB-PLSA, with 7 video streams. This time we use the adapted codebook for QB-PLSA at each
Recognition rate (%)
80 79.5
Proposed QB PLSA
79 78.5 78 77.5 77 76.5 0
1
2
3
4
5
6
7
Streams
Fig. 4. The performance of both the proposed algorithm and QB-PLSA
Online Learning for PLSA-Based Visual Recognition
107
stream.The accuracies of both algorithms on each of the video streams are depicted in Figure 4. The figure shows that the proposed algorithm also outperforms the QB-PLSA on the KTH dataset.
5
Conclusions
In this paper, we have proposed an on-line learning algorithm for PLSA-based visual recognition. The proposed algorithm can automatically and adaptively learn the parameters of a PLSA model during the on-line learning. We evaluate the proposed algorithm using datasets for scene recognition and human action recognition. Experimental results demonstrate that the proposed can handle data that batch PLSA learning cannot deal with, and its performance is comparable with that of the batch PLSA learning on visual recognition. Acknowledgement. This work was done when G. Ye was affiliated with National ICT Australia. National ICT Australia (NICTA) is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Center of Excellence program. Wei Wang is supported by ARC Discovery Grant DP0987273.
References 1. Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001) 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Machine Learning Research 3, 993–1022 (2003) 3. Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering objects and their location in images. In: ICCV (2005) 4. Bosch, A., Zisserman, A., Munoz, X.: Scene classification using a hybrid generative/discriminative approach. TPAMI 30, 712–727 (2008) 5. Carlos Niebles, J., Wang, H., Li, F.F.: Unsupervsied Learning of Human Action Categories Using Spatial-Temporal Words. IJCV 42, 993–1022 (2008) 6. Savarese, S., DelPozo, A., Niebles, J., Fei-Fei, L.: Spatial-temporal correlations for unsupervised action classification. In: WMCV (2008) 7. Xu, J., Ye, G., Wang, Y., Herman, G., Zhang, B., Yang, J.: Incremental EM for Probabilistic Latent Semantic Analysis on Human Action Recognition. In: AVSS (2009) 8. Chou, T., Chen, M.: Using incremental PLSI for threshold-resilient online event analysis. IEEE Transactions on Knowledge and Data Engineering 20, 289 (2008) 9. AlSumait, L., Barbar´ a, D., Domeniconi, C.: Online LDA: Adaptive Topic Model for Mining Text Streams with Application on Topic Detection and Tracking. In: ICDM (2008) 10. Chien, J., Wu, M.: Adaptive Bayesian latent semantic analysis. IEEE Transactions on Audio, Speech, and Language Processing 16, 198–207 (2008) 11. Hamilton, J.: A quasi-Bayesian approach to estimating parameters for mixtures of normal distributions. Journal of Business & Economic Statistics 9, 27–39 (1991)
108
J. Xu et al.
12. Vailaya, A., Figueiredo, M., Jain, A., Zhang, H., Technol, A., Alto, P.: Image classification for content-based indexing. TIP 10, 117–130 (2001) 13. Wu, J., Rehg, J.: Where am I: Place instance and category recognition using spatial PACT. In: CVPR (2008) 14. Gupta, P., Arrabolu, S., Brown, M., Savarese, S.: Video Scene Categorization by 3D Hierarchical Histogram Matching. In: ICCV (2009) 15. Vogel, J., Schiele, B.: Natural scene retrieval based on a semantic modeling step. LNCS, pp. 207–215. Springer, Heidelberg (2004) 16. Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: CVPR (2005) 17. Quattoni, A., Torralb, A.: Recognizing Indoor Scenes. In: CVPR (2009) 18. Li, L., Socher, R., Fei-Fei, L.: Towards Total Scene Understanding Classification, Annotation and Segmentation in an Automatic Framework. In: CVPR (2009) 19. Yilmaz, A., Shah, M.: Recognizing Human Actions in Videos Acquired by Uncalibrated Moving Cameras. In: ICCV (2005) 20. Little, J., Boyd, J.: Recognizing people by their gait: the shape of motion. Journal of Computer Vision Research 1, 1–32 (1998) 21. Ikizler, N., Forsyth, D.: Searching Video for Complex Activities with Finite State Models. In: ICCV, vol. 3 (2007) 22. Pruteanu-Malinici, I., Carin, L.: Infinite hidden markov models for unusual-event detection in video. TIP 17, 811–822 (2008) 23. Wang, Y., Mori, G.: Max-margin hidden conditional random fields for human action recognition. In: CVPR (2009) 24. Dollar, P., Vincent, R., Garrison, C.: Behavior recognition via sparse spatiotemporal features. In: VS-PETS (2005) 25. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV, vol. 1 (2005) 26. Yeffet, L., Wolf, L.: Local Trinary Patterns for Human Action Recognition. In: ICCV (2009) 27. Gilbert, A., Illingworth, J., Bowden, R.: Fast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal. In: ICCV (2009) 28. Dempster, A., Laird, N., Rubin, D., et al.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39, 1–38 (1977) 29. Papoulis, A., Pillai, S.U.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York (2002) 30. Sato, M.: Convergence of on-line EM algorithm. ICONIP 1, 476–481 (2000) 31. Fergus, R., Fei-Fei, L., Torralba, A.: ICCV 2005 short course on Object Recognition (2005), http://people.csail.mit.edu/fergus/iccv2005/bagwords.html 32. Google: Developer’s Guide to Picasa Web Albums Data API (2009), http://code.google.com/apis/picasaweb/overview.html 33. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91– 110 (2004) 34. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV 42, 145–175 (2001) 35. Horster, E., Lienhart, R., Slaney, M.: Continuous visual vocabulary modelsfor plsabased scene recognition. In: CIVR (2008) 36. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: CVPR (2004)
Emphasizing 3D Structure Visually Using Coded Projection from Multiple Projectors Ryo Nakamura, Fumihiko Sakaue, and Jun Sato Department of Computer Science and Engineering Nagoya Institute of Technology Nagoya, 466-8555, Japan
Abstract. In this paper, we propose a method for emphasizing 3D structure of the scene visually by blending patterned lights projected from multiple projectors. The proposed method enables us to emphasize specific 3D structure of the scene without capturing image of the scene and without estimating 3D structure of the scene. As a result, the 3D structure can be emphasized visually without any delay of computation. In this paper, we propose a method for generating the patterned light of projectors which enables us to emphasize arbitrary 3D structure of the scene efficiently.
1
Introduction
In computer vision, much attention has been paid for obtaining 3D structure information from camera images [1–3]. This is because the 3D structure information is useful for controlling various robots and machines [4] and for assisting human activities by augmenting 3D scenes [5–8]. In general, cameras, computers and displays are required for augmenting 3D scenes. This is because the cameras and computers are required for recovering 3D structure information, and computers and displays are required for presenting virtual information to human users. We can also use the combination of projectors and cameras for recovering 3D structures and presenting virtual information [9, 10]. However, all these existing systems require computational costs for recovering 3D information and for generating virtual information for human users. These computational costs are often very large and hamper the realization of these systems. In this paper, we consider the system configuration of augmented reality in very different way from those in the existing methods. We do not use image sensors to obtain information of the 3D scene, we do not use computers to recover 3D information, and we do not use displays to present 3D information to human users. We show that even if these components are not used, the augmentation of the 3D scene can be achieved, and useful 3D information can be presented to human users. In this paper, we show that by emitting patterned light from multiple projectors, we can emphasize specific 3D areas or specific shapes in the 3D space by coloring these areas and structures with specific colors. The shape of emphasized R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 109–122, 2011. c Springer-Verlag Berlin Heidelberg 2011
110
R. Nakamura, F. Sakaue, and J. Sato
(a)
(b)
(c)
Fig. 1. Emphasizing 3D structure visually by using multiple projectors. The specific areas or specific structures in the 3D space are emphasized by coloring the 3D scene directly using multiple projectors. In (a) convex parts of the surface are emphasized by coloring with red, while in (b) an object in a specific area is emphasized by coloring with red. In (c) the irregularity of products in a factory is emphasized.
areas can be chosen freely. For example, we can design patterned lights so that the convex part of the surface is emphasized visually as shown in Fig. 1 (a), which may be useful for vehicle drivers to see the irregularity of road surfaces. It is also possible to emphasize objects in a specific area in the 3D space visually as shown in Fig. 1 (b), which can be applied for extracting the irregularity of products in factories as shown in Fig. 1 (c). In our method, we only use multiple projectors. Instead of using computers, we use blending of lights emitted from multiple projectors for augmenting the 3D structures. Instead of using displays, we use object surfaces directly to show the augmented results to human users. As a result, we do not require cameras, computers and displays to realize augmentation of the 3D scene, and we only need multiple projectors. Since we do not use computers, the augmentation of the 3D scene can be achieved without any delay of computation. That is, all are achieved in the light speed.
2
Emphasizing 3D Structure Using Coded Projection
In this section, we explain the overview of 3D structure emphasis, which is based on coded lights emitted from multiple projectors. Let us consider a point xi on the image of projector i, where a light is emitted with the radiance of Ei . Suppose the ray of the light intersect with a plane π at a point X in the 3D space. Then, the irradiance Ii at point X is described as follows: Ii =
−Ei (L N) ((Si − X) Li )2
(1)
where, Si denotes the center of projection of projector i, and Li denotes the orientation of the projector, that is the direction of the principal axis of the projector. N represents the surface normal of π. Unlike the standard point light source, projector light is calibrated so that the front-parallel surface to the
Emphasizing 3D Structure Visually Using Coded Projection
111
Fig. 2. The billboard space and its projection to projector images
projector has the same irradiance everywhere on the surface. Thus, the irradiance of the surface changes according to the distance of the surface along the principal axis of the projector as shown in (1). Now, if we have K projectors and if these projectors illuminate the same point X in the 3D space, the irradiance I at point X can be described as follows: I=
K
Ii
i=1
=
K
Wi Ei
(2)
i=1
where, Wi represents the contribution of the point xi of ith projector to the irradiance of point X as follows: −(L N) Wi = (3) ((Si − X) Li )2 As shown in (3), the weight Wi of each projector light to the irradiance of a 3D point is different from each other. What we want to achieve in this research is to emphasize a certain area in the 3D space by coloring objects in the area emitting light according to the weight of each projector light. However, since a single ray of light may illuminates any point on the line of ray, it is not an easy task to generate proper pattern of multiple projector lights. In the following sections, we propose a method for generating projection patterns linearly to achieve arbitrary 3D space emphasis.
3
Billboard Model for Space Representation
We first consider the quantization of the 3D space. The voxel space is often used for representing the 3D space. However, since we consider the projection of light,
112
R. Nakamura, F. Sakaue, and J. Sato
we represent the 3D space by using a set of small billboards as shown in Fig. 2. We call it a billboard space. We derive the image pattern of projector lights, so that we can emphasize specific areas in the 3D space by projecting light from multiple projectors to the billboard space. Thus, the objective emphasis pattern is given in the billboard space. Suppose we have K projectors, each of which projects light from image point xi (i = 1, · · · , K) to an identical 3D point X as follows: xi = Pi X
(i = 1, · · · , K)
(4)
where, Pi denotes the projection matrix of ith projector. Note, in this paper, points are represented by using the homogeneous coordinates in equations. We assume these projectors are fully calibrated, and thus the projection matrices Pi (i = 1, · · · , K) are known. Then, by projecting a billboard by using (4), the projection of the billboard can be obtained in the projector image. Since each billboard has a certain width and a height, the projected billboard may extend over several pixels in the projector image, and the boundary of the billboard may not coincide with the boundary of image pixels in general as shown in Fig. 2. Thus, we define the contribution Wij of jth image pixel to ith billboard as the occupied area of the image pixel in the projected billboard as follows: Wij =
S(bi ∧ pj ) S(bi )
(5)
where, bi denotes the projection of ith billboard in the image, and pj denotes the jth image pixel. S(·) denotes the area. Then, we define a vector Wi which represents the contribution of each image pixel to the ith billboard as follows: (6) Wi = Wi1 . . . Wij . . . WiN where, N denotes the number of image pixels in a projector image. Now, we consider a vector Y, which represents the projection pattern of N pixels in the projector image: Y = Y1 . . . Yi . . . YN (7) where, Yi represents the image intensity of ith pixel in the projector image. Then, the intensity Ii of the ith billboard can be described as follows: Ii = Wi Y
(8)
By using (8) we next consider a method for deriving the projection patterns Yi (i = 1, · · · , K) of K projectors, which generate objective 3D emphasis in the 3D billboard space.
4
Generating Projection Patterns for 3D Emphasis
As we have seen in section 3, the projection pattern Y generates the intensity Ii of the ith billboard as shown in (8). If the projectors are calibrated, each
Emphasizing 3D Structure Visually Using Coded Projection
113
billboard can be projected to a projector image, and Wi (i = 1, · · · , M ) can be computed from (5). Let us consider a vector I which consists of the objective billboard intensities as follows: I = [I1 , · · · , Ii , · · · , IM ]
(9)
Then, the relationship between the projection pattern Y and the objective billboard pattern I can be described as follows: ⎡ ⎤ W1 ⎢ W2 ⎥ ⎢ ⎥ (10) ⎢ .. ⎥ Y = I ⎣ . ⎦ WM
We next consider multiple projectors. Suppose we have K projectors, and these projectors illuminate the same billboard space. Then, since the compound of multiple lights is described by the addition of radiances, the relationship between the projection patterns Yi (i = 1, · · · , K) of K projectors and the objective billboard pattern I can be described as follows: ⎡ 1 ⎤⎡ ⎤ Y1 W1 W12 . . . W1K ⎢ W21 W22 . . . W2K ⎥ ⎢ Y2 ⎥ ⎢ ⎥⎢ ⎥ (11) ⎢ .. .. .. .. ⎥ ⎢ .. ⎥ = I ⎣ ⎣ . ⎦ . ⎦ . . . 1 2 K YK WM WM . . . WM where, Wij denotes the weight vector Wi of jth projector. If we rewrite (11) as LY = I, then a least square solution of (11) can be computed as follows: Y = L+ I
(12)
where, L+ denotes the pseudo inverse of L, and is described as L+ = (L L)−1 L . Then, by projecting Yi (i = 1, · · · , K) from K projectors, the billboard space can be emphasized as it is designed by I. If we want to illuminate the billboard space with color, we design the objective billboard pattern I in each color channel, and do the above procedure in each color channel. Then, we can color the billboard space with arbitrary color patterns.
5
Representation of Negative Intensity
If we compute the projection patterns Yi (i = 1, · · · , K) by using the above method, they include negative intensities in general. However, negative intensities cannot be projected from real projectors. Thus, in this research, we represent negative intensities virtually by considering a virtual zero intensity at the middle of the intensity range.
114
R. Nakamura, F. Sakaue, and J. Sato
projector intensity
1 255
virtual zero level
0 positive intensity negative intensity
Fig. 3. Representation of virtual negative intensity
Suppose the 3D space is illuminated by a projector light with certain constant intensity I0 . Then, if we project a light with higher intensity than I0 , we can represent the positive intensity as shown in Fig. 3. On the other hand, if we project a light with lower intensity than I0 , we can represent the negative intensity virtually. Thus, by adding a certain amount of intensity to the objective billboard pattern I and computing the projection pattern Yi from (12), we can generate projector patterns properly, which represent negative intensities as well as positive intensities. Then, we can illuminate the 3D space as we wish. Up to now, we proposed a method for emphasizing the 3D scene by projecting coded lights from multiple projectors. We next show the efficiency of the proposed method in real projector experiments.
6 6.1
Experiments Real Projector Experiments
We first show the results from an experiment with real projectors. In this experiment, we used three projectors as shown in Fig. 4 (a). The resolution of these projectors is 640 × 480. We first calibrated these three projectors by using a calibration board and a camera, and computed the projection matrices of these projectors. Although we need a camera for the initial calibration of multiple projectors, we do not need it after the calibration.
projector 3 projector 2
projector 1
(a) Setup of three projectors
(b) 3D object
Fig. 4. The setup of three projectors and a 3D object used in our experiment
Emphasizing 3D Structure Visually Using Coded Projection
(a) objective 3D pattern
115
(b) synthesized 3D pattern
Fig. 5. The objective color pattern of the billboard space, i.e. objective emphasis pattern, and the color pattern of the billboard space, which was synthesized from the projection patterns shown in Fig. 6
(a) projector 1
(b) projector 2
(c) projector 3
Fig. 6. The projection patterns computed from the proposed method. (a), (b) and (c) show projection patterns for projector 1, 2 and 3 respectively, which were computed from the objective coloring pattern shown in Fig. 5 (a).
㽳 㽲 㽴 㽵㽶㽷 㽸㽺 㽹 position of the 3D object
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
Fig. 7. The results of 3D space emphasis. The 3D object was moved in the 3D space, and its colorization was observed at 9 different positions. The image number and the position number in the left image are corresponding to each other.
116
R. Nakamura, F. Sakaue, and J. Sato
(a) objective 3D pattern
(b) synthesized 3D pattern
Fig. 8. The objective color pattern of the billboard space and the color pattern of the billboard space, which was synthesized from the projection patterns shown in Fig. 9
Then, projection patterns of these projectors were computed from the projection matrices and an objective coloring pattern of the billboard space. In this experiment, the objective coloring pattern of the billboard space was defined as shown in Fig. 5 (a), i.e. the cylindrical space is colored with red in the 3D space. The computed projection patterns of three projectors are shown in Fig. 6. As shown in this figure, negative as well as positive intensities are required for generating objective 3D emphasis patterns, and it is realized by using the virtual negative intensity representation. For reference, we generated the coloring pattern of the 3D billboard space by using synthetic projector images. The coloring pattern of the billboard space synthesized from Fig. 6 is shown in Fig. 5 (b). As shown in this figure, the cylindrical space is colored with red properly. However, some errors exist in coloring. These errors may decrease if we use more projectors. We next projected the extracted projection patters by using three real projectors, and put a 3D object shown in Fig. 4 (b) into the space and moved it in the space. The results of coloring of this object are shown in Fig. 7. In this figure, we show the results of coloring at 9 different positions in the 3D space. These 9 positions are shown in the left image of Fig. 7. As shown in Fig. 7, the 3D object is colored with red when it is at the center, and is not colored at the other positions. Thus, we find that the 3D object can be colored and emphasized as we designed in the billboard space by using the proposed method. We next designed the objective coloring pattern, so that 3D objects are not colored at the center of the 3D space and are colored with red at the other positions as shown in Fig. 8. The projection patterns of three projectors for this objective coloring pattern were computed from our method as shown in Fig. 9, and the 3D object was colored by the projection patterns as shown in Fig. 10. From Fig. 10, we find that the proposed method can color the 3D space, so that only a specific area is not colored. From these results, we find that the proposed method can emphasize arbitrary shapes and structures in the 3D space.
Emphasizing 3D Structure Visually Using Coded Projection
(a) projector 1
(b) projector 2
117
(c) projector 3
Fig. 9. The projection patterns computed from the proposed method. (a), (b) and (c) show projection patterns for projector 1, 2 and 3 respectively, which were computed from the objective coloring pattern shown in Fig. 8 (a).
㽳 㽲 㽴 㽵㽶㽷 㽸㽺 㽹 position of the 3D object
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
Fig. 10. The results of 3D space emphasis. The 3D object was moved in the 3D space, and its colorization was observed at 9 different positions. The image number and the position number in the left image are corresponding to each other.
6.2
Accuracy Evaluation
We next evaluate the accuracy of coloring of the proposed method. The accuracy of coloring was evaluated by the RMS error E between the objective color pattern and the color pattern generated from the proposed method as follows: M 1 2 (Ii − Ii ) (13) E= M i=1 where, M denotes the number of billboards, and Ii and Ii denote the objective intensity and the intensity generated from the proposed method at ith billboard in the 3D space respectively. Representation of Negative Intensity. We first evaluate the efficiency of the representation of virtual negative intensity described in section 5.
118
R. Nakamura, F. Sakaue, and J. Sato
synthesized 3D pattern objective 3D pattern (a) (0.5 − 1) case
objective 3D pattern synthesized 3D pattern (b) (0 − 1) case Fig. 11. The objective 3D pattern and the synthesized 3D pattern from our method. (a) shows the results when the objective intensity range varies from 0.5 to 1. (b) shows the results when the objective intensity range varies from 0 to 1.
In this experiment, we assume that the intensity of each projector varies from 0 to 1, and thus the intensity of compound light varies from 0 to 3. We evaluated the RMS error between the objective color patterns and generated color patterns from our method in the billboard space. In this experiment, we chose two different intensity ranges of objective color patterns. One is from 0 to 1, and the other is from 0.5 to 1. Fig. 11 (a) shows the objective pattern and the generated pattern in the case of (0.5 − 1), and Fig. 11 (b) shows those in the case of (0− 1) respectively. Fig. 12 (a) and (b) show projection patterns of three projectors in these two cases. The horizontal axis shows the horizontal position in the projection pattern and the vertical axis shows the intensity of projection pattern. As shown in Fig. 12 (b), the projection pattern of (0 − 1) case includes negative intensities. These negative intensities are replaced by 0 before projecting from projectors, and thus generated pattern in the billboard space includes large errors. On the other hand, the projection pattern of (0.5 − 1) case is within the range from 0 to 1, and can be projected from projectors as it is. As a result, the RMS error of (0.5 − 1) case is 25.6, while that of (0 − 1) case is 63.6. Thus, we find that the virtual representation of negative intensity is very efficient.
Emphasizing 3D Structure Visually Using Coded Projection
projector 1
projector 2 (a) (0.5 − 1) case
projector 3
projector 1
projector 2 (b) (0 − 1) case
projector 3
119
Fig. 12. The projection pattern of three projectors. The horizontal axis in each graph shows the horizontal position in the projection pattern, and the vertical axis shows the intensity of projection pattern. (a) shows the results when the objective intensity varies from 0.5 to 1, while (b) shows the results when the objective intensity varies from 0 to 1.
6.3
Constellation of Multiple Projectors
We next evaluate the relationship between the constellation of multiple projectors and the RMS error of color patterns generated from our method in the billboard space. The three projectors are positioned at a fixed distance from the center of the billboard space, and the relative direction θ between each pair of projectors viewed from the center of the billboard space is changed from 5◦ to 45◦ . Thus, the distance between each pair of projectors is large when the relative direction θ is large. The color patterns generated from our method in the case of θ = 5◦ , 20◦ , 30◦ and 45◦ are shown in Fig. 13 (a), (b), (c) and (d) respectively, and the RMS errors of these generated patterns are shown in Fig. 14. The horizontal axis of this graph shows the relative direction between each pair of projectors and the vertical axis shows the RMS error of color patterns generated from our method. As shown in this graph, the coloring in the 3D space is more accurate when the relative direction of each pair of projectors is large. This property is analog to the accuracy of stereo cameras, where the 3D measurement is more accurate when the relative distance between two cameras is larger. 6.4
Number of Projectors
We next evaluate the relationship between the number of projectors and the accuracy of 3D coloring.
120
R. Nakamura, F. Sakaue, and J. Sato
(a) 5◦
(b) 20◦
(c) 30◦
(d) 45◦
RMS error in intensity
Fig. 13. The 3D color patterns generated from multiple projectors, when the relative direction between each pair of projectors viewed from the center of the billboard space is 5◦ , 20◦ , 30◦ and 45◦ 㪋㪇 㪊㪏 㪊㪍 㪊㪋 㪊㪉 㪊㪇 㪉㪏 㪉㪍 㪉㪋 㪉㪉 㪉㪇
㪌㫦 㪈㪇㫦 㪈㪌㫦 㪉㪇㫦 㪉㪌㫦 㪊㪇㫦 㪊㪌㫦 㪋㪇㫦 㪋㪌㫦
relative direction (degrees)
Fig. 14. The relationship between the relative direction between each pair of projectors and the RMS error of 3D color patterns generated from our method
In this experiment we changed the number of projectors used for coloring the 3D space from 2 to 7, and evaluated the RMS error of generated color patterns in the 3D space. For avoiding the effect of the variation in projector positions, we fixed the position of the left most and the right most projectors, and inserted projectors between these two with a constant interval. The 3D color patterns generated from our method in the case of 2, 3, 4, 5, 6 and 7 projectors are shown in Fig. 15, and the RMS errors of these generated patterns are shown in Fig. 16. The horizontal axis of this graph shows the number of projectors, and the vertical axis shows the RMS error of 3D color patterns generated from our method. As shown in this graph, the coloring in the 3D space is more accurate when we use more projectors in general. However, the accuracy degrades when we use 6 or more than 6 projectors. This is because the simple linear computation without any restriction provides us large intensity variations in projector images when we use many projectors, and the variation of intensity
Emphasizing 3D Structure Visually Using Coded Projection
(a) 2 projectors
(d) 5 projectors
(b) 3 projectors
(c) 4 projectors
(e) 6 projectors
(f) 7 projectors
121
RMS error in intensity
Fig. 15. The 3D color patterns generated from the proposed method, when we use 2, 3, 4, 5, 6 and 7 projectors 㪊㪉 㪊㪇 㪉㪏 㪉㪍 㪉㪋 㪉㪉 㪉㪇
㪉
㪊
㪋
㪌
㪍
㪎
number of projectors
Fig. 16. The relationship between the number of projectors and the RMS error of 3D color patterns generated from our method
exceeds the range of projector intensity. This problem may be avoided by using conditional least square methods adding conditions in the range of each projector intensity.
7
Conclusion
In this paper, we proposed a method for emphasizing 3D structure of the scene visually by using patterned lights projected from multiple projectors. For generating projector images of arbitrary 3D space emphasis, we proposed a method for designing the objective 3D color patterns in the billboard space, and computing the projector images by counting the contribution of each projector to the 3D space coloring.
122
R. Nakamura, F. Sakaue, and J. Sato
The proposed method enables us to emphasize specific 3D structure of the scene without capturing image of the scene and without estimating 3D structure of the scene. As a result, the 3D structure can be emphasized visually without any delay of computation.
References 1. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003) 2. Faugeras, O., Luong, Q.-T., Papadopoulo, T.: The Geometry of Multiple Images. MIT Press, Cambridge (2001) 3. Agarwal, S., Snavery, N., Simon, I., Seitz, S.M., Szeliski, R.: Building Rome in a Day. In: Proc. International Conference on Computer Vision (2009) 4. Malis, E.: Visual servoing invariant to change in camera intrinsic parameters. In: Proc. International Conference on Computer Vision, vol. 1, pp. 704–709 (2001) 5. Kanade, T., Rander, P., Narayanan, P.J.: Virtualized Reality: Constructing Virtual Worlds from Real Scenes. IEEE Multimedia 4(1), 34–47 (1997) 6. Oka, K., Sato, I., Nakanishi, Y., Sato, Y., Koike, H.: Interation for entertainment contents based on direct manipulation with bare hands. In: Proc. IWEC, pp. 397– 404 (2002) 7. Milgram, P., Kishino, F.: A taxonomy of mixed reality visual display. IEICE Transactions on Information and System E77-D(12), 1321–1329 (1994) 8. Sukuthankar, R., Stockton, R., Mullin, M.: Smarter presentations: Exploiting homography in camera-projector systems. In: Proc. ICCV 2001, vol. 1, pp. 247–253 (2001) 9. Hall-Holt, O., Rusinkiewicz, S.: Stripe Boundary Codes for Real-Time StructuresLight Range Scanning of Moving Objects. In: Proc. International Conference on Computer Vision, Proc. ICCV, pp. 359–366 (2001) 10. Zhang, L., Curless, B., Seitz, S.: Rapid Shape Acquisition using Color Structured Light and Multi-Pass Dynamic Programming. IEEE 3D Data Processing Visualization and Transmission (2002)
Object Class Segmentation Using Reliable Regions Vida Vakili and Olga Veksler Computer Science Department University of Western Ontario London, Canada
Abstract. Image segmentation is increasingly used for object recognition. The advantages of segments are numerous: a natural spatial support to compute features, reduction in the number of hypothesis to test, region shape itself can be a useful feature, etc. Since segmentation is brittle, a popular remedy is to integrate results over multiple segmentations of the scene. In previous work, usually all the regions in multiple segmentations are used. However, a typical segmentation algorithm often produces generic regions lacking discriminating features. In this work we explore the idea of finding and using only the regions that are reliable for detection. The main step is to cluster feature vectors extracted from regions and deem as unreliable any clusters that belong to different classes but have a significant overlap. We use a simple nearest neighbor classifier for object class segmentation and show that discarding unreliable regions results in a significant improvement.
1
Introduction
Over the past couple of decades, most object recognition approaches were based on the sliding window scanning [1–5], localizing the bounding box of an object. While very successful for some applications, such as face and pedestrian detection [2], there are several well known issues with the sliding window approach. First, it is mostly appropriate for object classes that are well-approximated by a rectangle, such as cars, but not for object classes with thin parts, such as giraffes. Another problem is that there is no precise pixel-level segmentation of the detected object. Also, sliding window approach is expensive computationally, since a large number of sub-windows at different scales needs to be processed in a single image. In recent years, there has been a lot of interest in using results of a generic image segmentation algorithm to help obtain pixel-precise object segmentation [6–15]. Segmentation is used to subdivide an image into a set of regions. These regions do not have predefined shapes like rectangular patches and their boundaries are more likely to align with the boundaries of objects in the scene. Therefore the shape and boundary properties of regions can be used for feature extraction. Furthermore, it has been shown in [16] that regions from unsupervised bottomup segmentation can provide a better spatial support for object detection than R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 123–136, 2011. c Springer-Verlag Berlin Heidelberg 2011
124
V. Vakili and O. Veksler
rectangular windows. Another advantage of using image regions is scalability and potential savings in computational efficiency. Image regions provide a usually much smaller set of hypothesis to examine compared to the sliding window approach, and at “natural” scales that are obtained through segmentation. Ideally, a good segmentation algorithm would cleanly separate the objects or at least the object parts. However, the state-of-the art in segmentation is still very far from this goal. With hierarchical segmentation there is a higher chance that a complete object or an object part is contained at some hierarchy level. Therefore some methods use a single hierarchical segmentation algorithm [14, 17] for object recognition. However, even hierarchical algorithms suffer from instability, the results depends on the choice of the feature space and parameter settings. Another strategy is to use smaller image regions, or ”superpixels”, as in [15], and regularize the classifier by aggregating histograms in the neighborhood of each superpixel. However, if superpixels are too small, they may not have enough spatial support for feature extraction. If superpixels are bigger, then there is a larger chance for an error due to inaccuracies in superpixel boundaries, since superpixels are used as indivisible units. A more commonly used strategy is to integrate the results over multiple different segmentations of the same scene [6, 13, 18, 19]. Usually several unsupervised bottom-up segmentation algorithms, each with different parameters are used. With more segments, there is a higher chance that some segments contain whole objects or easily recognizable object parts. In support of using multiple segmentations, in [16] they show that usually regions generated by a single segmentation of a scene do not contain entire objects. Previous work that relies on multiple segmentations either makes an assumption that at least one segmentation separates the object from the background [6, 19], or uses all segments as reliable for object class recognition [13]. Clearly many regions from multiple segmentations are rather generic blobs that do not contain features useful for discrimination. This is because most bottomup segmentation algorithms are implicitly or explicitly designed to partition an image into a set of uniform patches in some feature space. The main idea of our approach is to find reliable regions and discard the ones that are not reliable for object recognition. This should result in more stable decisions at the time of object recognition. In the training stage, first multiple image segmentations are obtained by varying the parameters in FH [20] and Mean Shift [21] segmentation algorithms. Each segment is described by a large set of features based on color, texture, etc. Then discriminative features are selected using the Relief-F algorithm [22]. For robustness, we do not want to decide on reliability individually for each segment. It is more reliable to make this decision for a whole cluster containing similar segments. A single object class is usually segmented into different possibly overlapping parts, each part may bear little similarity to the other parts. Therefore we cluster each object class (separately from the other classes) using the k-means algorithm, with the aim that each cluster contains similar segments of the object class. Clusters from different objects could have significant intersection, which
Object Class Segmentation Using Reliable Regions
125
makes them unreliable for recognition. The overlapping clusters contain segments that are generic blobs not corresponding to an easily recognizable object part. These clusters are discarded. In the test stage, first multiple image segmentations are generated from the test image and features are extracted for each segment. Reliable segments are found based on their distance to the reliable clusters obtained at the training stage. Since our goal is to learn to determine unreliable segments, and not necessarily to come up with the state of the art object class segmentation system, we use a simple nearest neighbor classifier as opposed to more sophisticated techniques. Initially, each reliable segment is classified independently, and the results may not be spatially coherent. To encourage coherency and integrate the results of multiple segmentations, we use graph-cut optimization [23] to generate the final result. This leads to a significant improvement in accuracy, compared to single segmentation results. Experimental results on MSRC database show that discarding unreliable regions results in a significant accuracy improvement. Even though the goal was not to come up with the state-of-the art classifier, surprisingly, a simple nearest neighbor classifier with reliable regions is not too far from the state-of-the art methods.
2
Related Work
Our work is most related to the methods that use multiple segmentation for various object recognition tasks. An early work that uses segmentation for recognition is [24]. Their goal is associating labels to image regions, and use multiple segmentation algorithms to evaluate which one is better for their task. The method introduced in [6] seeks to discover visual object categories and their segmentation in a collection of unlabeled images starting with regions generated with several unsupervised image segmentation approaches. However, they make an assumption that is often violated in practice. They assume that at least one of the regions contains the full object. Multiple segmentations were also used by [18] to label each pixel with its rough geometric surface class. Multiple segmentations contribute to a wider set of hypothesis to explore as compared to a single segmentation. Their work popularized the use of multiple segmentations for various computer vision tasks. In [19], they use multiple stable segmentations to localized objects in a weakly supervised framework. A segmentation is considered stable if small perturbations of the image do not cause significant change in segmentation results. The idea is that stable segments might be more adequate for object recognition. However, stable regions could still correspond to generic blobs, not necessarily be useful for object recognition. In [13], they classify the regions generated from multiple segmentations in supervised manner and then integrate all segments to get the final result. However in their approach they have an assumption that all the regions generated by multiple segmentations are beneficial for object recognition, which is likely not be true in practice.
126
V. Vakili and O. Veksler
The importance of incorporating multiple segmentations into object recognition, as opposed to a single segmentation, is investigated in [16]. They show two important results: good spatial support is important for recognition and multiple segmentations provide better spatial support for objects than bounding boxes and a single segmentation.
3
Proposed Approach
In this section, we describe our approach to the object class segmentation with reliable regions. In the training stage, reliable regions for each object class are determined and retained, and unreliable regions are discarded. In the test stage, unreliable regions are discarded as well. Reliable regions are classified with the nearest neighbor classifier using the pool of reliable regions obtained at the time of training. Finally region classifications are combined and regularized using graph cuts to get the final object map. 3.1
Obtaining Multiple Segmentations
To obtain multiple segmentations, we use two popular and efficient segmentation algorithms, FH [20] and the Mean-Shift [21]. Multiple segmentations are generated by varying the parameters of each algorithm. We obtain 18 distinct segmentations for the FH algorithm, and 16 for the mean-shift algorithm. Figure 1 shows some examples of segmentation images that are generated for one of the input images. This figure illustrates the possible variation in segmentation regions and different characteristics of the segmentation algorithms. 3.2
Features from Regions
We explored features that are based on size, location, color, texture, probability of the boundary, and edge histograms. Location, shape, color and texture features are the same as in [18]. Probability of the boundary is based on the boundary detector introduced in [25], which is better at detecting object boundaries than the conventional edge detection methods because it is less distracted by texture gradients. We take the mean value of probability of the boundary within a two pixel wide radius of the region boundary. Distribution of intensity gradients can be useful for shape and appearance description. Therefore two kinds of edge descriptors are used to capture edge cues: Histogram of Oriented Gradient (HOG) [3] and Edge Orientation Histogram (EOH) [26]. In our work HOG and EOH are computed for 8 bins for all the pixels in each region and also for pixels around the boundary of regions within two pixels radius. They are normalized by the size of the region or the size of the boundary of the region, correspondingly. The initial set of features has size 219. In order for clustering to have a reliable performance, it is necessary to reduce dimensionality [27]. We use the Relief-F algorithm [22] for feature selection in this work. We retain the top 15 features ranked
Object Class Segmentation Using Reliable Regions
127
Fig. 1. Some examples of segmentation results from FH and Mean Shift algorithms with different input parameters
by Relief-F. These features are based on color, location, edge histogram and texture filters. It is interesting to note that boundary features were not selected. 3.3
Clustering
For robustness, we want to make a decision on the reliability of a whole group of similar segments, rather than decide for each segment separately. Therefore we need to perform segment clustering within each object class. A single object class is usually segmented into several possibly overlapping parts, sometimes bearing little similarity to each other. As illustrated in Figure 2 a cow generally gets segmented into the head, body, and legs. Therefore segments resulting from multiple segmentations of multiple instances of a single object class typically consist of several distinct clusters. To facilitate discovery of reliable segments, first the clustering structure resulting from multiple segments of a single object class must be captured. We use k-means [28] to cluster each object class. The number of clusters for each object is chosen between 4 to 8 so that the resulting number of clusters maximizes the silhouette coefficient. Figure 3 shows some representative samples for four out of six clusters from the cow class and three out of five clusters from the car class. Each cluster corresponds to a separate column. For cow class, the clusters tend to correspond to bodies, heads, vertical segments, horizontal segments, respectively. For the car class, the clusters seem to correspond to the horizontal structures, body, and the “lights and wheels” parts.
128
V. Vakili and O. Veksler
Fig. 2. First column shows the original images and the second column their typical segmentations
Fig. 3. Left: Four clusters from the cow class, right: 3 clusters from the car class. Each column corresponds to a separate cluster. Six representative members are shown for each cluster.
Object Class Segmentation Using Reliable Regions
3.4
129
Discarding Unreliable Segments
So far we used all regions generated from multiple segmentations in the clustering, however clusters from different objects could have lots of intersection with each other which makes them unreliable for object detection. To get rid of unreliable clusters, those clusters that have a large amount of intersection with clusters of different object classes should be discarded. In our work, first the average intersection of each of the clusters with all other clusters of different classes is calculated and then for each object class two or more clusters with the least amount of intersection are retained. These clusters are reliable and will lead to more certain and stable decision at the time of object recognition. There are many different ways to measure cluster overlap. Since the number of training samples is relatively small and the dimensionality of our data is relatively high, we avoid measuring cluster overlap in the original feature space to avoid the problems associated with high dimensionality. Instead we project the two clusters for which we want to measure the overlap into the best one dimensional subspace so that the samples from each cluster stay as separated as possible. This step is performed with Fisher linear discriminant [29]. According to Fisher linear discriminant, two clusters should be projected on the line in the direction V which maximizes: J(V ) =
(μ1 − μ2 )2 s21 + s22
(1)
μ is the mean and s2 is the scatter of projection. The best V for projection is found with eigenvector analysis. After two clusters are projected on the best line that separates them, the value of equation (2) can be used to quantify the overlap in the projected space: J(cli , ckj ) =
(μ1 − μ2 )2 σ12 + σ22 ,
(2)
where cli denotes cluster i from object class l; σ12 and σ22 are the variances of projections of two clusters. The larger is the value in equation (2), the better is the separation between the two clusters. To measure how reliable one cluster is, the average of equation (2) is calculated over all clusters of other classes, see equation (3): l k j,k=l J(ci , cj ) l , (3) J(ci ) = ι where ι is the number of clusters such that their object label is not equal to l. For each object class we choose two or more clusters with the best value of J(cli ). In Figure 3, for the cow class, the first and second clusters (corresponding to the first and second column) are reliable, and the last two are unreliable. For the car class, the first cluster is unreliable and the last two are reliable. It can be seen from the figure that the reliable clusters contain segments that are more
130
V. Vakili and O. Veksler
easily recognizable as parts of the object class. Unreliable clusters tend to have segments that are more generic and shared between classes, such as blobs of certain eccentricity and orientation.
4
Object Class Segmentation
So far the process of creating a pool of reliable clusters from training images was discussed. Given a novel image to segment into its constituent object classes, we first generate multiple segmentations using FH and mean-shift algorithms. For each segment we extract fifteen features selected as discriminative at the training stage. We discard unreliable segments using the following procedure. We compute the distance from the current segment to the cluster centers obtained at the training stage. If the current segment is closest to an unreliable cluster center, it is itself deemed as unreliable and discarded. Otherwise it is classified using the k-nearest neighbor classifier. We set k = 8 for the experiments. Notice that since we discard unreliable segments for a test image, there is a chance that a pixel does not get classified because it is contained only in unreliable segments. We found experimentally that the percentage of such pixels is small. In most cases, if a pixel is in unreliable generic blob in one segmentation, it often does belong to a reliable region for another segmentation, usually at a different scale. The small percentage of pixels that do not get classified at this stage do get classified at the next stage, which we now describe. Since all the segments are classified independently, to improve coherence and integrate the results of multiple segmentations, we use graph cut optimization framework [23]. We seek a final labeling f that minimizes the following energy: Dp (fp ) + wpq · δ(fp = fq ) (4) E(f ) = p∈P
(p,q)∈N
The first term in equation (4) is called a data term and it is the penalty for pixel p to be assigned label fp . Dp measures how well label fp suits pixel p. This constraint ensures that the current labeling f is consistent with the observed data. We set the data term as follows. The more often pixel p was classified as object class l, the smaller is Dp (l), making object class l more likely for p. Specifically, Dp (fp ) is defined as the negative number of segments that contain pixel p and get classified as fp . The second term in equation (4) is known as the smoothness term. The smoothness term measures the extent to which f is not smooth. We use the Potts Model. Here δ(fp = fq ) is 0 if fp = fq and 1 otherwise. A typical choice (Ip −Iq )2
for wpq is λe− 2σ2 , where Ip and Iq are the intensities of pixels p and q. Parameter σ is related to the level of variation between neighboring pixels. The parameter λ is used to control the relative importance of the data term versus the smoothness term. We set λ to 10 in this work. This energy function can be approximately optimized with the alpha-expansion algorithm of [23]. The code for alpha-expansion is taken from Boykov et al. [23, 30, 31].
Object Class Segmentation Using Reliable Regions
131
Notice that if a pixel belongs only to unreliable segments, i.e. it did not get classified with any object class for any segmentation, then its data terms are equal to zero for all classes. Such pixel gets classified based only on propagation from it surrounding context due to the smoothness term.
5
Experimental Results
We tested our approach on the MSRC 21-class data set [32]. In this data set pixels close to object boundaries and in non-object regions are labeled void in the ground truth images. These pixels are ignored in evaluation. Since ground truth may contain different labels for any given segment, for training, we use only the segments that have more than 70% label consistency, that is at least 70% of pixels should belong to a single class. Table 1 gives the quantitative results using graph cuts for integrating multiple segmentations. The second row shows the results when only reliable regions are used for object detection and the first row is the results of using all regions generated by all of the segmentations. As it can be seen, the results are better by 11% in pixel-averaged and class-averaged accuracy when just the reliable regions are used. For all object classes except the sheep, sky and boat object classes, using just the reliable regions outperform the approach where all regions are used. This shows that lots of regions do not have informative features to distinguish an object or an object part. Using all of these regions instead of just the reliable ones can mislead object recognition.
boat
body
cat
dog
road
book
chair
sign
bird
bike
flower
car
face
water
sky
aeroplane
cow
sheep
tree
grass
building
Pixel Average Graph cut (all
Class Average
Table 1. Quantitative performance using multiple segmentations. In the first row, all regions from multiple segmentations are used for recognition while in the second row just reliable regions are used.
57 44 47 81 60 23 49 96 31 46 60 41 39 65 35 22 43 24 63 30 11 33 31
regions) Graph cut
68 55 60 90 70 50 43 87 46 87 64 54 46 71 43 26 67 25 75 39 50 37 24
(reliable regions)
We also measure the improvement graph cut integration of multiple segmentations gives over single segmentation results. We compute accuracy improvement of graph cut results over each of the 34 single segmentation result. On average, the class-averaged accuracy increases by 7.8% with graph cuts, with standard deviation of 0.03%. The smallest improvement is 4.0% and the best improvement is 14.5%. Figure 4 shows some successful qualitative results. Some partial successful examples or failures are shown in Figure 5. The first column contains the input
132
V. Vakili and O. Veksler
Fig. 4. Successful qualitative results generated using graph cuts for integrating multiple image segmentations. a) Original images. b) Ground truth images. c) Results generated using all the regions. d) Results generated using reliable regions. The black pixels in all result images are not labeled in ground truth images and therefore are not considered in our evaluation. This picture is best viewed in color.
Object Class Segmentation Using Reliable Regions
133
images, the second column shows the ground truth segmentation. The third and fourth columns present maps of the most likely class at each pixel using all regions and reliable regions, respectively. The black pixels in the result images are considered as void in ground truth images and are not considered in our evaluation. These are the pixels for which no accurate ground truth could be constructed by the authors of the dataset. As can be seen from Figure 4 and Figure 5 using all regions leads to less accurate results as opposed to using only the reliable regions.
Fig. 5. Partially successful or failure results generated using graph cuts for integrating multiple image segmentations. a) Original images. b) Ground truth. c) Results generated using all the regions. d) Results generated using reliable regions. The black pixels in all result images are not labeled in ground truth images and therefore are not considered in our evaluation. This picture is best viewed in color.
5.1
Comparison to Previous Work
The purpose of this work was not to come up with the state-of-the art algorithm but to show that discarding unreliable regions from multiple segmentations gives better results for object recognition. This was shown using a simple nearest neighbor framework. Even though the goal was not a state of the art algorithm, we compare the performance of the proposed system to the state-of the art recent work [7, 8, 13]. Notice that they [7, 8, 13] use much more sophisticated machine learning algorithms and features. Nevertheless our overall performance is not that much worse, which is surprising, given the simplicity of our classifier. Furthermore, we are even able to achieve better accuracy for some classes. Our approach is superior on “water”, “sign” , “bird” “chair”, “dog” and “boat” object classes.
134
V. Vakili and O. Veksler
boat
cat
dog
road
book
chair
sign
bird
bike
car
face
sky
7
Verbeek and
cow
Shotton et al. [7] 72 58 62 98 86 58 50 83 60 53 74 63 75 63 35 19 92 15 86 54 19 62
tree
body
flower
water
aeroplane
sheep
grass
building
Pixel Average
Class Average
Table 2. The comparison of our approach with that of Shotton et al. [7], Verbeek and Triggs [8] and Pantofaru et al. [13]
74 64 52 87 68 73 84 94 88 73 70 68 74 89 33 19 78 34 89 46 49 54 31
Triggs [8] Pantofaru et
74 60 68 92 81 58 65 95 84 81 75 65 68 53 35 23 85 16 83 48 29 48 15
al. [13] reliable regions
68 55 60 90 70 50 43 87 46 87 64 54 46 71 43 26 67 25 75 39 50 37 24
The running time of our approach could be improved significantly, it is currently in minutes. The most time consuming part is obtaining multiple segmentations and extracting features.
6
Conclusion
We develop a novel object recognition algorithm based on selecting reliable regions from multiple segmentations. We produce pixel-wise segmentation of the objects in the image. To increase the robustness of our system to poor segmentations and also in order integrate the results of multiple segmentations, information from multiple image segmentations is integrated using graph cut optimization framework. The main contribution of our work is a clustering approach to find the reliable regions for object recognition. Experiments show that when using only the reliable regions, the results improve by approximately 11% both in pixel-averaged and class-averaged accuracy.
References 1. Rowley, H., Baluja, S., Kanade, T.: Human face detection in visual scenes. In: Advances in Neural Information Processing Systems, pp. 875–881 (1996) 2. Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision 57, 137–154 (2004) 3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Conference on Computer Vision and Pattern Recognition, p. 886 (2005) 4. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 36–51 (2008) 5. Lampert, C., Blaschko, M., Hofmann, T., Zurich, S.: Beyond sliding windows: Object localization by efficient subwindow search. In: Conference on Computer Vision and Pattern Recognition (2008)
Object Class Segmentation Using Reliable Regions
135
6. Russell, B., Efros, A., Sivic, J., Freeman, W., Zisserman, A.: Using multiple segmentations to discover objects and their extent in image collections. In: Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1605–1614 (2006) 7. Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: European Conference on Computer Vision, pp. 1–15 (2006) 8. Verbeek, J., Triggs, B., Inria, M.: Region classification with markov field aspect models. In: Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 9. Verbeek, J., Triggs, B.: Scene segmentation with conditional random fields learned from partially labeled images. In: Proc. NIPS (2008) 10. He, X., Zemel, R.: Learning hybrid models for image annotation with partially labeled data. In: Advances in Neural Information Processing Systems, NIPS (2008) 11. Schroff, F., Criminisi, A., Zisserman, A.: Object class segmentation using random forests (2008) 12. Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. International Journal of Computer Vision 80, 300–316 (2008) 13. Pantofaru, C., Schmid, C., Hebert, M.: Object recognition by integrating multiple image segmentations. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 481–494. Springer, Heidelberg (2008) 14. Gu, C., Lim, J., Arbelaez, P., Malik, J.: Recognition Using Regions. In: Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Proc. CVPR, pp. 1030–1037 (2009) 15. Fulkerson, B., Vedaldi, A., Soatto, S.: Class segmentation and object localization with superpixel neighborhoods. In: IEEE International Conference on Computer Vision, pp. 670–677 (2009) 16. Malisiewicz, T., Efros, A.: Improving spatial support for objects via multiple segmentations. In: British Machine Vision Conference (BMVC), vol. 2, BMVC (2007) 17. Todorovic, S., Ahuja, N.: Learning subcategory relevances for category recognition. In: Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 18. Hoiem, D., Efros, A., Hebert, M.: Recovering surface layout from an image. International Journal of Computer Vision 75, 151–172 (2007) 19. Galleguillos, C., Babenko, B., Rabinovich, A., Belongie, S.: Weakly supervised object localization with stable segmentations. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 193–207. Springer, Heidelberg (2008) 20. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. International Journal of Computer Vision 59, 167–181 (2004) 21. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 603– 619 (2002) 22. Kira, K., Rendell, L.: A practical approach to feature selection. In: Proceedings of the Ninth International Workshop on Machine Learning Table of Contents, pp. 249–256. Morgan Kaufmann Publishers Inc., San Francisco (1992) 23. Boykov, Y., Veksler, O., Zabih, R.: Efficient approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1222–1239 (2001)
136
V. Vakili and O. Veksler
24. Barnard, K., Duygulu, P., Guru, R., Gabbur, P., Forsyth, D.: The effects of segmentation and feature choice in a translation model of object recognition. In: Conference on Computer Vision and Pattern Recognition, pp. 675–682 (2003) 25. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 530–549 (2004) 26. Levi, K., Weiss, Y.: Learning object detection from a small number of examples: The importance of good features. In: Conference on Computer Vision and Pattern Recognition, vol. 2 (2004) 27. Berkhin, P.: A survey of clustering data mining techniques. Grouping Multidimensional Data, 25–71 (2006) 28. Hartigan, J.: Clustering algorithms. Wiley, New York (1975) 29. Fisher, R.: The use of multiple measures in taxonomic problems. Ann. Eugenics 7, 179–188 (1936) 30. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1124–1137 (2004) 31. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 147– 159 (2004) 32. Shotton, J., Winn, J., Rother, C., Criminisi, A.: The MSRC 21-class object recognition database (2006)
Specular Surface Recovery from Reflections of a Planar Pattern Undergoing an Unknown Pure Translation Miaomiao Liu1 , Kwan-Yee K. Wong1 , Zhenwen Dai2 , and Zhihu Chen1 1 2
Dept. of Computer Science, The University of Hong Kong, Hong Kong Frankfurt Institute for Advanced Studies, Goethe-University Frankfurt
Abstract. This paper addresses the problem of specular surface recovery, and proposes a novel solution based on observing the reflections of a translating planar pattern. Previous works have demonstrated that a specular surface can be recovered from the reflections of two calibrated planar patterns. In this paper, however, only one reference planar pattern is assumed to have been calibrated against a fixed camera observing the specular surface. Instead of introducing and calibrating a second pattern, the reference pattern is allowed to undergo an unknown pure translation, and a closed form solution is derived for recovering such a motion. Unlike previous methods which estimate the shape by directly triangulating the visual rays and reflection rays, a novel method based on computing the projections of the visual rays on the translating pattern is introduced. This produces a depth range for each pixel which also provides a measure of the accuracy of the estimation. The proposed approach enables a simple auto-calibration of the translating pattern, and data redundancy resulting from the translating pattern can improve both the robustness and accuracy of the shape estimation. Experimental results on both synthetic and real data are presented to demonstrate the effectiveness of the proposed approach.
1
Introduction
Shape recovery has always been a hot topic in computer vision over the past few decades. Traditional shape recovery methods can be classified into different categories according to the information they employed, such as structure from motion [1], shape from silhouettes [2], and shape from shadings [3]. Most of the existing methods, however, can only handle diffuse objects due to the fact that the information they exploited is derived under the assumption of a diffuse surface. For instance, structure-from-motion methods cannot be applied to a specular (mirror) object as the features on the object surface are the reflections of its surrounding environment and are therefore not viewpoint independent. Similarly, shape-from-silhouettes methods have difficulties in handling specular objects as it is not a trivial task to extract the silhouette of a mirror object. The underlying principle of shape-from-shadings methods states that the shadings of the object follow the Lambert’s cosine law, and therefore it is obvious that R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 137–147, 2011. c Springer-Verlag Berlin Heidelberg 2011
138
M. Liu et al.
specular surfaces cannot be recovered using shape-from-shadings methods. In addition, the shape of a specular object also cannot be obtained using a laser scanner since the surface may often reflect the laser ray away from the sensor of the scanner. In order to allow shape recovery of specular objects, special algorithms have to be developed under the assumption of a specular surface whose appearance follows the law of reflection. In [4], Zisserman et al. pointed out that a moving observer can tell the physical properties, such as convexity or concavity, of a specular surface. Fleming et al. [5] also showed that human beings have the ability to distinguish the shape of a specular surface even when no knowledge about the environment is available. In [6], Oren and Nayar explored the relation between image trajectories of virtue features induced by specular reflection and 3D geometry of the specular object they travelled on. Roth and Black [7] proposed to use both optical flow and specular flow to recover an object which is a mixture of specular and diffuse materials. In [8], Adato et al. analyzed valued information of specular flow induced by the movement of a camera, and derived a non-linear partial differential equation (PDE) to describe the relation between the 3D shape and motion of the specular object. In [9], Canas et al. linearized this non-linear PDE in order to make it easier for practical use. The aforementioned works are regarded as shape from specular flow (SFSF) methods. SFSF methods generally require a dense observation of specular flow, but dense specular flows often merge together in an undetermined way. Besides, SFSF methods recover the specular surface by solving PDE, and the solution is greatly affected by the initial boundary condition. Furthermore, the motion of the specular object must be small and continuous in order to derive the PDE. All these limitations make SFSF methods not very useful in practice. Great efforts have also been devoted to specular shape recovery techniques under known surroundings. In [10, 11], Savarese and Perona demonstrated that local surface geometry properties of a mirror object can be determined by analyzing the local differential properties of the reflection of two calibrated lines. In [12], Bonfort and Sturm introduced a voxel carving method to recover a specular surface using a normal consistency criterion. Seitz et al. [13] reduced the specular shape recovery problem to a light path reconstruction problem. They showed that surface points can be obtained if the positions of two reference points are known in space for one single view, or the position of one reference point is known and is visible in two different views. In [14], a phase shift method was used to determine the correspondence between a reference point in space and its reflection on the image plane. Nehab et al. [15] reduced the shape recovery problem to an image matching problem by minimizing a cost function based on normal consistency. Note that calibration plays an important role in specular shape recovery methods under known surroundings. Bonfort et al. [16] used a mirror to calibrate a reference plane after changing its position every time. In [17], Rozenfeld et al. suggested an approach to recover the surface by using an empirically calculated 1D homography. Their proposed method only needs sparse plane-image correspondences, but depends on optimization greatly.
Specular Surface Recovery from a Translating Planar Pattern
139
In this paper, we adopt the approach of specular shape recovery under known surroundings. Unlike previous methods which assume a fully calibrated environment, we propose a novel solution based on observing the reflections of a translating planar pattern. We assume a reference planar pattern is initially calibrated against a fixed camera observing a specular surface. This pattern is then allowed to undergo an unknown pure translation. The motion of the pattern as well as the shape of the specular object can then be estimated from the reflections of the translating pattern. The main contributions of this paper are 1) a closed form solution for recovering the motion of the translating pattern, and 2) a novel method for shape recovery based on computing the projections of the visual rays on the translating pattern. The proposed auto-calibration method allows an easy calibration of the translating pattern, and data redundancy resulting from the translating pattern can improve both the robustness and accuracy of the shape estimation. The proposed shape recovery method produces a depth range for each pixel in which the true surface point exists. This in turn provides a measure of the accuracy and confidence of the estimation. The rest of the paper is organized as follows. Section 2 describes the physical configuration of our proposed translating pattern approach. Based on geometric constraints, a closed form solution is derived for recovering the unknown motion of the translating pattern in Section 3. A degenerate case is also studied and analyzed. Section 4 introduces a projection based method for specular shape recovery using the estimated motion. Experimental results on both synthetic and real data are presented in Section 5, followed by conclusions in Section 6.
2
Translating Planar Pattern
Consider a pinhole camera, with camera center O, looking at a specular object S which reflects a nearby planar pattern Π to the image I (see Fig.1). Consider now a point P on S. Let A be a point on Π which is reflected by S at P to a point x on I. Suppose now the planar pattern undergoes a pure translation given by an unknown translation vector T, and let Ω denotes the planar pattern at the new location after translation. Let B be a point on Ω such that it is collinear with A and P . According to the law of reflection, B will also be reflected by S at P to the same point x on I. As shown in [13, 16], the 3D position of P can be obtained by intersecting the visual ray (defined by O and x) with the reflection ray (defined by A and B). In order to construct the reflection ray, both the 3D positions of A and B are needed, and their accuracies would directly affect the accuracy of the estimated 3D position of P . The relative positions of A and B on the planar pattern can be resolved by encoding the pattern using gray code and observing the corresponding intensities at x. Besides, in order to obtain the 3D positions of A and B, the relative positions and orientations of Π and Ω with respect to O must also be known. In this paper, Π is assumed to have been calibrated against O initially. The problem of calibrating Ω is thus reduced the problem of estimating the unknown translation vector T. In the next section, a closed form solution for T will be derived based on some geometric constraints.
140
M. Liu et al.
S
Ω
∏ P
A
B
T x
I
O Fig. 1. Configuration of a translating planar pattern. A pinhole camera with camera center O is viewing a specular object S which reflects a nearby planar pattern Π to the image I. Ω is a planar pattern obtained by translating Π by an unknown translation vector T. A and B are points on Π and Ω respectively which are reflected by S at P to the same image point x on I.
3
Recovering the Unknown Translation
Referring to the configuration described in the previous section, Ω and Π are related by a pure translation given by the translation vector T. Consider a point X on Π. Let X be a point on Ω such that it has the same relative position on the planar pattern as X (i.e., both X and X are encoded by the same gray code). X and X are referred to as a pair of corresponding points, and are related by X = X + T, where X and X denote the position vectors for X and X respectively. Now referring back to Fig. 1, and recalling that A and B are points on Π and Ω, respectively, which are reflected by S at P to the same image point x on I. Let Q be a point on Π which is encoded by the same gray code as B (i.e., B and Q form a pair of corresponding points). Hence, we have B = Q + T,
(1)
and the vector AB in the direction of the reflection ray can be written as AB = AQ + T.
(2)
Consider the plane Γ formed by the visual ray and the reflection ray at P , and let N be its unit normal vector. It is obvious that AB and N are mutually orthogonal and therefore AB · N = 0. (3)
Specular Surface Recovery from a Translating Planar Pattern
141
Substituting (2) into (3) gives (AQ + T) · N = 0.
(4)
Note that Q can be inferred from the gray code of B, and the vector AQ can be constructed from the 3D positions of A and Q on the calibrated planar pattern Π. The unit normal vector N can be computed as the vector cross product of Ox and OA. Hence, the only unknown in (4) is the translation vector T, which has three degrees of freedom. Since a pair of reflections observed at one pixel provides one linear constraint on T, T can therefore be estimated using a linear least squares method from the reflections observed at three or more pixels: ⎛
⎞−1 ⎛ NT A1 Q1 · N1 1 ⎜ NT ⎟ ⎜ ⎜ 2 ⎟ ⎜ A2 Q2 · N2 T = −⎜ . ⎟ ⎜ .. ⎝ .. ⎠ ⎝ . NT n
⎞ ⎟ ⎟ ⎟ ⎠
(5)
An Qn · Nn
If n > 3, the matrix composed of normal vectors will be a rectangle matrix, and the symbol of inverse in (5) denotes the Moore-Penrose pseudo-inverse. The pseudo-inverse gives the solution in a least-squared sense. However, when the specular object is a plane or a surface of revolution, there exist some degenerate cases which are described below in Proposition 1 and Proposition 2. Proposition 1. If the specular object is a plane, the translation vector T cannot be determined by using the proposed theory. Proof. Let S be a specular plane. Without loss of generality, let P be an arbitrary point on S and NP be the surface normal at P . Consider now the visual ray OP . By the law of reflection, the visual ray OP , the surface normal NP and the reflection ray must be all lie on the same plane. Let ΓP denotes such a reflection plane. Note all points on S have the same surface normal NP . It follows that all the reflection planes will intersect along a line passing through the camera center O and with an orientation same as NP , and hence their normal vectors are all orthogonal to NP . This makes the matrix composed of the normal vectors in (5) to have a rank of at most 2, and therefore T cannot be recovered using (5). Proposition 2. If the specular object is a surface of revolution and its revolution axis passes through the camera center, the translation vector T cannot be determined by using the proposed theory. Proof. Let S be a specular surface of revolution with its revolution axis l passing through the camera center O. Without loss of generality, let P be an arbitrary point on S and NP be the surface normal at P . Consider now the visual ray OP . By the law of reflection, the visual ray OP , the surface normal NP and the reflection ray must all lie on the same plane. Let ΓP denotes such a reflection plane. By the properties of a surface of revolution, NP lies on the plane defined
142
M. Liu et al.
by P and l. Since l passes through O, it is easy to see that l also lies on ΓP . It follows that all the reflection planes will intersect at l and hence their normal vectors are all orthogonal to l. This makes the matrix composed of normal vectors in (5) to have a rank of at most 2, and therefore T cannot be recovered using (5). Since a sphere is a special kind of surface of revolution with any line passing through its center being a revolution axis, Corollary 1 follows immediately. Corollary 1. A single sphere cannot be recovered by using the proposed theory.
4
Specular Surface Recovery
With the estimated translation vector T, the position of the reference planar pattern after translation can be determined. If accurate correspondences between points on the reference planar pattern and pixels on the image can be obtained, the surface of the specular object can be recovered by triangulating the corresponding visual ray and reflection ray at each pixel (see Section 2). However, most of the encoding strategies cannot achieve a real point-to-point correspondences. Usually, one pixel corresponds to one encoded area (e.g., a square on the reference planar pattern). In [16], a dense matching of the pixels on the image and positions on the reference plane was implemented by minimizing an energy function which penalized the mis-encoding problem, and bilinear interpolation was used to achieve a sub-pixel accuracy in the matching. Bilinear interpolation, however, is indeed not a good approximation as it is well-known that ratio of lengths is not preserved under perspective projection. Besides, the reflection of the pattern by a non-planar specular surface would also introduce distortion in the image of the pattern, and such a distortion depends on both the shape and distances of the object relative to the camera and the planar pattern. In order to obtain good shape recovery results, the resolution of the encoding pattern and the positions of the plane need to be chosen carefully. In this paper, the specular surface is assumed to be smooth and locally planar. After associating a gray code (and hence an encoded area on the reference plane) to each pixel, the algebraic centers ci for pixels with the same gray code is computed and associated with the center Ci of the corresponding encoded area on the reference plane. This is the approximation to the point-to-point correspondences in the underlying encoding strategy. This approximation has been verified with experiments on synthetic data, and experimental results show that the error distances introduced by approximating the ground truth reflection points on the plane with the centers of the encoded areas are negligible. Nevertheless, this approximation can only be applied to obtain a matching between the image pixels and points on one reference plane, as the algebraic centers ci computed for one reference plane position will not in general coincide with those computed for a different reference plane position. Instead of trying to locate a point match on the translated reference plane Ω for each ci , the visual ray for ci is projected onto the translated reference plane Ω using its corresponding point Ci on the initial reference plane Π as the projection center (see Fig. 2). The projection of
Specular Surface Recovery from a Translating Planar Pattern
143
Ω Π S
Ci
P
B
Ci
O
Fig. 2. Shape recovery of specular object. The visual ray at ci is projected onto the translated reference plane Ω using its corresponding point Ci on the initial reference plane Π as the projection center. The projection of the visual ray on Ω will result in a line which will intersect the corresponding encoded area as observed at ci . This intersection line segment can be back projected onto the visual ray and this gives a depth range in which the surface point should exist.
the visual ray on Ω will result in a line which will intersect the corresponding encoded area (which is a square) as observed at ci . This intersection line segment can be back projected onto the visual ray and this gives a depth range in which the surface point should exist. If Ω is far from Π, the depth range will be small and the midpoint can be taken as the estimated surface depth. Alternatively, the depth range can be made smaller by increasing the resolution of the encoding pattern (i.e., decreasing the size of the encoded area). This, however, is not a practical strategy since the encoding pattern will be blurred in the imaging process and may result in an aliasing effect.
5
Experimental Analysis
The proposed approach was tested using both synthetic and real image data. The synthetic images were generated using the pov-ray software. For the real experiments, images were taken by using a simple setting composed of a monitor, a camera and a drawer. The monitor was used for displaying gray coded patterns. The camera used was a Cannon 450D with a focal length of 50mm. The drawer was used for carrying the monitor and making it move in pure translation motions. 5.1
Synthetic Experiments
In the synthetic experiments, the scene was composed of two spheres or two planes with different positions and orientations. Note that the cases with multiple
144
M. Liu et al.
Fig. 3. Shape recovery result for a sphere. Left column: Ground truth observed in two different views. Right column: Recovered shape in two different views. 1.4 1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150
Fig. 4. Relationship between depth recovery error and translation scale. The scene is composed of two spheres which have the same radius of 19mm, and the translation scale ranges from 1mm to 150mm. 12880 pixels are used for shape recovery. Y-axis shows the average depth recovery error for all pixels, and X-axis represents the translation scale.
spheres or multiple planes are not degenerate, and the shape of the objects can be recovered using the proposed method. In order to analyze the effect of the translation scale on the result, the reference plane was translated with multiple scales. The reconstruction for a part of a sphere is shown in Fig. 3, which also gives a comparison with the ground truth data. Fig. 4 shows the relation between the translation scale of the reference plane and the shape recovery error. It can
Specular Surface Recovery from a Translating Planar Pattern
145
Fig. 5. Shape recovery result for a plane. Left column: Ground truth observed in two different views. Right column: Recovered shape in two different views.
Fig. 6. Setting for the real experiment. Left Column: Setting used in the real experiment. Right Column: sample pattern images used for encoding. The proposed method is used to recover the area reflecting the reference plane on the specular object.
be seen that, within the visible reflection range for the specular object, larger translation scale would result in smaller reconstruction error. After experimenting with a range of translation scales, 80mm is chosen as the translation scale for the second position of the reference plane. The reconstruction error is smaller than 0.5%. The reconstruction for the plane is shown in Fig.5, which again also gives a comparison with the ground truth data. 70mm is chosen as the translation scale and the average reconstruction error is less than 0.45%.
146
M. Liu et al.
0 −0.1 −0.2 −0.3
1
−0.4 2.5
0.9 0.8
2
0.7
1.5
0.6
1 0.5
0.5
Fig. 7. Shape recovery result for a spoon. The recovered shape is the area with reflections on the spoon.
5.2
Real Experiments
The real experiment was conducted to recover the shape of a small spoon. The experiment setting was quite simple. The monitor was carried by a drawer and translated on the table (see Fig. 6). The monitor displayed gray coded pattern images and a set of images were taken when the monitor translated in different scales. A traditional structured light method [18] was used to get the correspondences between a coded area on the reference plane and a pixel on the image. The first reference planar pattern is initially calibrated. The reference plane can be translated along a certain direction in different scale and the reconstruction results can be obtained by choosing the best motion scale or combine the multiple motion information together to get a better result. Although the spoon is small and it only reflects a small area on the monitor, the shape can be well recovered (see Fig. 7).
6
Conclusions
In this paper, we have proposed a novel solution for specular surface recovery based on observing the reflections of a translating planar pattern. Unlike previous methods which assume a fully calibrated environment, only one reference planar pattern is assumed to have been calibrated against a fixed camera observing the specular surface, and this pattern is allowed to undergo an unknown pure translation. The motion of the pattern as well as the shape of the specular object can then be estimated from the reflections of the translating pattern. The main contributions of this paper are 1) A closed form solution for recovering the motion of the translating pattern. This allows an easy calibration of the translating pattern, and data redundancy resulting from the translating pattern (i.e., reflections of multiple planar patterns at each pixel location) can improve both the robustness and accuracy of the shape estimation. 2) A novel method for shape recovery based on computing the projections of the visual rays on the translating pattern. This produces a depth range for each
Specular Surface Recovery from a Translating Planar Pattern
147
pixel in which the true surface point exists. This in turn provides a measure of the accuracy and confidence of the estimation. Experimental results on both synthetic and real data are presented, which demonstrate the effectiveness of our proposed approach. As for the future work, we are now trying to solve the more challenging problems in which 1) the initial reference planar pattern is also uncalibrated and 2) the motion of the reference planar pattern is not limited to pure translation.
References 1. Tomasi, C.: Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision 9, 137–154 (1992) 2. Wong, K.Y.K., Cipolla, R.: Structure and motion from silhouettes. In: ICCV, pp. 217–222 (2001) 3. Woodham, R.: Photometric method for determining surface orientation from multiple images. Optical Engineering 19, 139–144 (1980) 4. Zisserman, A., Giblin, P., Blake, A.: The information available to a moving observer from specularities. Image and Vision Computing 7, 38–42 (1989) 5. Fleming, R.W., Torralba, A., Adelson, E.H.: Specular reflections and the perception of shape. Journal of Vision 4, 798–820 (2004) 6. Oren, M., Nayar, S.K.: A theory of specular surface geometry. International Journal of Computer Vision 24, 105–124 (1997) 7. Roth, S., Black, M.J.: Specular flow and the recovery of surface structure. In: CVPR, pp. 1869–1876 (2006) 8. Adato, Y., Vasilyev, Y., Ben-Shahar, O., Zickler, T.: Toward a theory of shape from specular flow. In: ICCV, pp. 1–8 (2007) 9. Canas, G.D., Vasilyev, Y., Adato, Y., Zickler, T., Gortler, S., Ben-Shahar, O.: A linear formulation of shape from specular flow. In: ICCV, pp. 1–8 (2009) 10. Savarese, S., Perona, P.: Local analysis for 3d reconstruction of specular surfaces. In: CVPR, pp. 738–745 (2001) 11. Savarese, S., Perona, P.: Local analysis for 3d reconstruction of specular surfaces - part ii. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 759–774. Springer, Heidelberg (2002) 12. Bonfort, T., Sturm, P.: Voxel carving for specular surfaces. In: ICCV, pp. 691–696 (2003) 13. Seitz, S., Matsushita, Y., Kutulakos, K.: A theory of inverse light transport. In: ICCV, pp. 1440–1447 (2005) 14. Yamazaki, M., Iwata, S., Xu, G.: Dense 3D reconstruction of specular and transparent objects using stereo cameras and phase-shift method. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part II. LNCS, vol. 4844, pp. 570–579. Springer, Heidelberg (2007) 15. Nehab, D., Weyrich, T., Rusinkiewicz, S.: Dense 3d reconstruction from specularity consistency. In: CVPR, pp. 1–8 (2008) 16. Bonfort, T., Sturm, P., Gargallo, P.: General specular surface triangulation. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 872–881. Springer, Heidelberg (2006) 17. Rozenfeld, S., Shimshoni, I., Lindenbaum, M.: Dense mirroring surface recovery from 1d homographies and sparse correspondences. In: CVPR, pp. 1–8 (2007) 18. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: CVPR, pp. 195–202 (2003)
Medical Image Segmentation Based on Novel Local Order Energy LingFeng Wang1 , Zeyun Yu2 , and ChunHong Pan1 1
2
NLPR, Institute of Automation, Chinese Academy of Sciences {lfWang,chpan}@nlpr.ia.ac.cn Department of Computer Science University of Wisconsin-Milwaukee
[email protected]
Abstract. Image segmentation plays an important role in many medical imaging systems, yet in complex circumstances it is still a challenging problem. Among many difficulties, problem caused by the image intensity inhomogeneity is the key aspect. In this work, we develop a novel localhomogeneous region-based level set segmentation method to tackle this problem. First, we propose a novel local order energy, which interprets the local intensity constraint. And then, we integrate this energy into the objective energy function. After that, we minimize the energy function via a level set evolution process. Extensive experiments are performed to evaluate the proposed approach, showing significant improvements in both accuracy and efficiency, as compared to the state-of-the-art.
1
Introduction
Level set model has been extensively applied to medical image segmentation, because it holds several desirable advantages over traditional image segmentation methods. First, it can achieve sub-pixel segmentation accuracy of object boundaries. Second, it can change the topology of the contour automatically. Existing level set methods can be mainly classified into two groups: i.e. the edge-based [1–3] methods and the region-based [4–14] methods. Edge-based methods utilize the local edge information to attract the active contour toward the object boundaries. These methods have two inevitable limitations [4], such as depending on the initial contour as well as sensitive to the noise. Region-based methods, on the other hand, are introduced to overcome the two limitations. These methods identify each interest region by a certain region descriptor, such as intensity, to guide the evolution of the active contour. The principle of regionbased methods tend to rely on the assumption of homogeneity of the region descriptor. In this work, we improve the traditional region-based method to tackle the segmentation difficulties, such as the inhomogeneity of the image intensity. There are mainly two types of region-based segmentation methods: i.e. the global-homogeneous [4–6] and local-homogeneous [7–14]. One of the most important global-homogeneous methods is proposed by Chan et al. [4]. This method R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 148–159, 2011. c Springer-Verlag Berlin Heidelberg 2011
Medical Image Segmentation Based on Novel Local Order Energy
149
first assumes that image intensities are statistically homogeneous in each segmented region. Then, it utilizes the global fitting energy to represent this homogeneity. Improved performance of the global-homogeneous methods has been reported in [4–6]. However, the details of the object always cannot be segmented out by these methods. As shown in Fig.1, parts of brain tissues are not segmented out. The local-homogeneous methods are proposed to solve intensity inhomogeneity. These methods assume that the intensities in a relatively small local region are separable. Li et al. [7, 10] introduced a local fitting energy due to a kernel function, which is used to extract the local image information. Motivated by the local fitting energy proposed in [10], the local Gaussian distribution fitting energy is presented in [14]. An et al. [8] offered a variational region-based algorithm, which combines the Γ -Convergence approximation with a new multi-scale piecewise smooth model. Unfortunately, the local-homogenous methods have two inevitable drawbacks. First, these methods are sensitive to the initial contour. Second, the segmentation results have obvious errors. As shown in Fig.1, although the two initial contours are very similar, the segmentation results of [10] are quite different. Furthermore, both the two results have obvious segmentation errors. We use the blue ellipses to indicate these errors. Motivated by the previous works [4, 7, 8, 10, 15], the local intensity constraint is adopted to realize our local-homogeneous region-based segmentation method. The purposes of using the local intensity constraint are from two aspects. First, the local characteristic can soften the global constraint of globalhomogeneous methods. Second, using additional constraint can strengthen the instability of local-homogenous methods. In this work, we first develop a local order energy as local intensity constraint. The proposed local order energy assumes that intensities in a relatively small local region has order. For example, the intensities in the target are larger than those in the non-target. Generally, the local order assumption is easy to be satisfied in medical image. Then, we integrate the local order energy into the objective energy function, and minimize the energy function via a level set evolution process. Experimental results show the proposed method has many advantages and improvements. First, compared to the global-homogeneous methods, the detail information is well segmented. Second, the two main drawbacks of the traditional local-homogeneous methods are well solved. Third, the proposed method is also insensitive to noise. The remainder of the presented paper is organized as follows: Section 2 describes the proposed segmentation method in detail; Section 3 presents some experiments and results; the conclusive remark is given in Section 4.
2
The Proposed Method
In this section, we first briefly introduce two traditional segmentation methods, i.e. the global-homogenous methods and the local-homogenous methods. Then, we describe the proposed local order energy, which interprets the local intensity constraint, in detail. Finally, we use level set evolution process to minimize the energy function, and present our algorithm in Alg.1.
150
L. Wang, Z. Yu, and C. Pan
ÿ
ÿ
Fig. 1. A comparison of the proposed method to the traditional methods, i.e. the Chan et al. [4] (a global-homogeneous method) and the Li et al. [10] (a local-homogenous method). Note that, two different yet very similar initial contours are used to test the initialization sensitive of three algorithms.
2.1
Global-Homogenous and Local-Homogenous Methods
Let Ω ⊂ R2 be the image domain, and I : Ω → R be the given gray-level image. Global-homogenous methods formulated the image segmentation problem as to find a contour C. The contour segments the image into two non-overlapping regions, i.e. the region of inside contour as Cin (the target region) and the region of outside contour as Cout (the non-target region). By specifying two constants cin and cout , the fitting energy (global fitting energy) is defined as follows, I(x) − cin 2 dx E global (C, cin , cout ) = λin Cin I(x) − cout 2 dx + μ|C| (1) + λout Cout
where |C| is the length of the contour C, and λin , λout and μ are positive constants. As expressed in Eqn.1, the specified two constants cin and cout globally govern all the pixel intensities in Cin and Cout , respectively. Thus, Chan’s method is denoted as global-homogeneous method. The two constants cin and cout can further be interpreted as the two cluster centers, i.e. the target cluster cin and the non-target cluster cout . Motivated by Chan’s work, Li et al. [10] proposed a local fitting energy due to a kernel function, given by local f it E (2) C, cin (x), cout (x) = E C, cin (x), cout (x) dx + μ|C| yielding,
2 E f it C, cin (x), cout (x) = λin Kσ x − y I(y) − cin (x) dy Cin 2 Kσ x − y I(y) − cout (x) dy + λout Cout
(3)
Medical Image Segmentation Based on Novel Local Order Energy
151
where K(.) is the Gaussian kernel function Kσ x =
2 2 1 e−x /2σ pa(2π)n/2 σ n rameterized with the scale parameter σ. More comprehensive interpretations of the local fitting energy are introduced in [10] in detail. The fitting values cin (x) and cout (x) are locally govern the pixel intensities centered at position x, whose size is controlled by σ. Thus, Eqn.2 is named as the local fitting energy, and Li’s method is denoted as local-homogeneous method. Similarly, cin (x) and cout (x) can be interpreted as two local cluster centers at position x. The main difference between the above two methods is how to choose the cluster centers. The global-homogenous method simply utilizes two constants cin and cout as the cluster centers, and formulates them as the global fitting energy. In contrary, the local-homogenous method adopts the local cluster centers, and formulates them as the local fitting energy.
2.2
Local Order Energy
Generally, medical image has inhomogeneity problem. Thus, local fitting energy interprets the segmentation problem better than global fitting energy. However, local fitting energy also has some problems. The most serious one is that local fitting energy function has many local optimal solutions. Thus, it is hard to reach the global optimal solution through the optimization method, such as the level set evolution process. Meanwhile, this problem brings two inevitable drawbacks. First, the solution is sensitive to the initialization. Second, the segmentation result has some obvious errors. Actually, the local optimal solutions of global fitting energy is less than those of local fitting energy. That is, the global-homogenous method has more stable solution than the local-homogenous method. The main reason is that global fitting energy utilizes a strong intensity constraint. However, the local fitting energy weakens this intensity constraint by using the local characteristic. Thus, it is reasonable to add other local intensity constraints. Here, we adopt the local order energy as the local intensity constraint. The proposes of using this energy are from two sides. First, using local constraints can solve the inhomogeneity problem better than using global constraints. Thus, the detail information can be well segmented out. Second, adding other constraints can obviously decrease local optimal solutions number. Thus, the segmentation result can be stable and with small errors. In the next, we first give a straightforward example to interpret the local optimal solution problem. Then, we present the local order energy, which is one of the important local intensity constraint. Fig.2 presents a straightforward example to interpret the local optimal solution problem. We use a simple synthetic image with half bright pixels and half dark pixels as the test image. As shown in Fig.2, the curves Cright (in red) and Cwrong (in blue) represent the right and wrong segmentation results, respectively. When specifying the local cluster centers cin (x) and cout (x), both two local fitting energies E local (Cright , cin (x), cout (x)) and E local (Cwrong , cin (x), cout (x)) calculated from Cright and Cwrong can reach to the local minimums. That is, both E local (Cright , cin (x), cout (x)) and E local (Cwrong , cin (x), cout (x)) are the local optimal solutions. As shown in the second row of Fig.2, we assign three boundary
152
L. Wang, Z. Yu, and C. Pan
pixels (in green dashed curve) with cin (x) = 1 and other pixels with cin (x) = 1 and cout (x) = 0. Here, the local fitting energy E local (Cright , cin (x), cout (x)) reaches to the local minimum. As shown in the third row of Fig.2, we assign three boundary pixels with cout (x) = 1 and other pixels with cin (x) = 1 and cout (x) = 0. Here, the local fitting energy E local (Cwrong , cin (x), cout (x)) reaches to the local minimum. Note that, the curve Cwrong is the wrong segmentation result. We can also enumerate other wrong results, which are similar with Cwrong .
First Row
1
1
1
1
1
1
1
1
1
1
1
1
1wrong 0 1 0right 1
0
0
0
0
0
0
0
0
0
0
0
0
0
Second Row
right
cin (x)
cout ( x)
1 1 1 1 1 x x x x x 1 1 1 1 1 x x x x x 1 1 1 1 1 x x x x x
x x x x x 0 0 0 0 0 x x x x x 0 0 0 0 0 x x x x x 0 0 0 0 0
Third Row
wrong
cin (x)
cout ( x)
1 1 1 1 x x x x x x 1 1 1 1 x x x x x x 1 1 1 1 x x x x x x
x x x x 1 0 0 0 0 0 x x x x 1 0 0 0 0 0 x x x x 1 0 0 0 0 0
Fig. 2. Overview of the comparison of two segmentation results. 1 and 0 are the intensity values of the image. The first row gives two segmentation curves Cright (in red) and Cwrong (in blue). The second row presents the cin (x) and cout (x) of each pixel when curve is Cright , while the third row presents them when curve is Cwrong . Note that, the symbol x means the random value.
There are two types of values. The first one is the value of non-target cluster centers cout (x) inside the curve C. The second one is the value of target cluster centers cin (x) outside the curve C. From the Eqn.2 and Eqn.3, we find that these two types of values are none sense when calculating the local fitting energy. That is, we can set them as the random values. As shown in Fig.2, we use symbol x to represent the random value. However, these values have useful meanings in
Medical Image Segmentation Based on Novel Local Order Energy
153
practice. Such as, cin (x) means the local target cluster center, and the cout (x) means the local non-target cluster center. Moreover, these values are recursive used when perform level set evolution process. The evolution process will be described in Subsection 2.3. Or you can find the similar process in [10] in detail. In this work, these values are used to formulate the local order energy, which is a local intensity priori constraint. Actually, in a small local region of a medical image, the intensity values always have order. Such as the target intensity values are larger than the non-target ones. Thus, the target cluster center value cin (x) is larger than the non-target cluster center value cout (x) at position x. The local order information is represented by the local order energy, given by, E order C, cin , cout = sgn cout (x) − cin (x) dx (4) Ω
where sgn(.) is the sign function, sgn(x) =
1 x>0 0 Else
(5)
As shown in Eqn.4, the local order energy uses all cluster centers, which include the above two types of values. Furthermore, the local order energy, is one of the important local intensity constraint, because it restricts the local target intensity should larger than the non-target intensity statistically. Combining the local fitting energy defined in Eqn.2, we can get the new local energy E E = E local + E order
(6)
where E local and E order are defined by Eqn.2 and Eqn.4, respectively. Since the local fitting energy and the local order energy are both defined from the contour C, the final energy E is represented by the contour C too. 2.3
Level Set Formulation and Energy Minimization
Level set formulation is performed to solve the above energy E. In level set methods, a contour C is represented by the zero level set of a Lipschitz function φ : Ω → R, which is called a level set function. Accordingly, the energy of our method can be defined as 2 F φ, cin , cout = λin Kσ x − y I(y) − cin (x) H φ(y) dydx 2 + λout Kσ x − y I(y) − cout (x) 1 − H φ(y) dydx sgn cout (x) − cin (x) dx + μL(φ) + νP(φ) (7) +γ Ω
where λin , λout , γ, μ, and ν are the weighting constants, H(.) is the Heaviside function, L(φ) = δ(φ(x))|∇φ(x)|dx is the length term, δ(.) is the derivative
154
L. Wang, Z. Yu, and C. Pan
function of Heaviside function, and P(φ) = 12 (|∇φ(x)| − 1)2 dx. The term P(φ) is the regularization term [3], which serves to maintain the regularity of the level set function. In implement, Heaviside
function is approximated by a smooth function H (x) = 12 1 + π2 arctan( x ) , and the corresponding derivative function is defined by δ (x) = H (x) = π(2+x2 ) . When implementation, the coefficient is set as = 1.0. Standard gradient descent method is performed to minimize the energy function F (φ, cin , cout ) based on two sub-steps. For the first step, we fix the level set functional φ, and calculate two cluster centers cin (x) and cout (x). For the second step, we fix the two cluster centers cin (x) and cout (x), and calculate the level set functional φ. Variational method is first applied to calculate cin and cout for a fixed level set functional φ, given by, cin = max cin , cout cin , cout cin = min (8) yielding,
Kσ (x) ⊗ H φ(x) I x cin =
Kσ (x) ⊗ H φ(x)
cout
Kσ (x) ⊗ 1 − H φ(x) I x = Kσ (x) ⊗ 1 − H φ(x)
(9) where ⊗ is the convolution operation. Then, keeping cin and cout fixed, the energy functional is minimized with respect to the level set functional φ by using the standard gradient descent, given by, ∂φ = − δ (φ) λin Tin − λout Tout ∂t ∇φ ∇φ + μδ div + ν ∇2 φ − div (10) |∇φ| |∇φ| where Tin and Tout are defined as follows, 2 Tin (x) = Kσ (y − x)I(y) − cin (x) dy 2 Tin (x) = Kσ (y − x)I(y) − cin (x) dy
(11)
The more information of the implementation of the proposed local prior intensity level set method is illustrated in Alg.1 in detail.
3
Experiment Results
A number of real medical images are used to evaluate the performance of the proposed scheme by comparing the two traditional methods, i.e. Chan et al. [4] and Li et al. [10], which are the deputations of global-homogeneous and localhomogeneous methods, respectively. When implementation, the coefficients are
Medical Image Segmentation Based on Novel Local Order Energy
155
Algorithm 1. Local Intensity Priori Level Set Segmentation Data: initial level set φinitial , input image I, max iteration maxiter, and weighting constants Result: level set φ and segmentation result. 1 1 Initial the output level set function φ = φinitial ; 2 for i ← 1 to maxiter do 3 Calculate cin and cout based on φi by Eqn.8 and Eqn.9; 4 Calculate level set φi+1 based on cin and cout by Eqn.10 and Eqn.11; 5 if φi equals to φi+1 then 6 Break the iteration; 7 end 8 end 9 Saving level set φ, and segmenting the input image based on φ;
Initializations
Chanÿ s Method
Liÿ s Method
Proposed Method
Fig. 3. The comparison of our method to Chan [4] and Li [10] on CT image
empirically set as λin = 1, λout = 1, ν = 1, μ = 0.004 · h · w, where h and w are the height and width of the test image. All the medical images are downloaded from the web-site 1 and 2 . The visual comparison is first shown in Fig.1, and the more comparisons are illustrated in Fig.3, Fig.4, and Fig.5, respectively. We use three different image acquisition techniques, i.e. CT, MR, and Ultrasound, to evaluate our approach. As illustrated in these figures, although the images are rather noisy and part of the boundaries are weak, the detail information are still segmented very well comparing to the Chan’s [4] and Li’s [10] methods. Furthermore, the proposed approach is not sensitive to the initialization compared to the Li’s [10] method. 1 2
http://www.bic.mni.mcgill.ca/brainweb/ http://www.ece.ncsu.edu/imaging/Archives/ImageDataBase/Medical/index.html
156
L. Wang, Z. Yu, and C. Pan
Initializations
Chanÿ s Method
Liÿ s Method
Proposed Method
Fig. 4. The comparison of our method to Chan [4] and Li [10] on Ultrasound image
Initializations
Chanÿ s Method
Liÿ s Method
Proposed Method
Fig. 5. The comparison of our method to Chan [4] and Li [10] on MR image
The numeric comparisons are presented in Fig.6 and Fig.7. The segmentation results are compared to the ground truth by using the following two error measures: i.e. false negative (FN), which interprets number of target pixels that are missed, and false positive (FP), which interprets number of non-target pixels are marked as target. Denoting Rour target are the result of target region of by gt li our method. Similarly, Rchar , R , target Rtarget are the Chan’s, Li’s, and ground in truth. Thus, the FN and FP are defined as follows FP =
R∗target − R∗target ∩ Rgt target Rgt target
FN =
Rt argetgt − R∗target ∩ Rgt target Rgt target (12)
Medical Image Segmentation Based on Novel Local Order Energy
Scene 1
Scene 2
Scene 3
Scene 4
Chan FN Mean
FP Cov
Mean
Scene 5
Li Cov
Time(s)
FN Mean
Mean
Scene 6
Our
FP Cov
157
Time(s)
Cov
FN Mean
FP Cov
Mean
Cov
Time(s)
Scene 1
0.131 0.001 0.018 0.000
3.293
0.142 0.125 0.163 0.151
12.49
0.026 0.000 0.019 0.000
12.91
Scene 2
0.144 0.001 0.021 0.001
3.395
0.152 0.153 0.102 0.173
13.16
0.017 0.000 0.015 0.000
13.46
Scene 3
0.123 0.000 0.012 0.000
3.351
0.178 0.203 0.128 0.145
13.04
0.020 0.000 0.017 0.000
13.12
Scene 4
0.126 0.000 0.014 0.001
3.362
0.101 0.146 0.137 0.126
13.07
0.018 0.000 0.016 0.000
13.22
Scene 5
0.132 0.001 0.016 0.000
3.325
0.122 0.116 0.124 0.136
12.86
0.011 0.000 0.014 0.000
13.10
Scene 6
0.126 0.000 0.014 0.001
3.338
0.186 0.107 0.149 0.127
12.81
0.019 0.000 0.021 0.000
13.09
Fig. 6. An overview of the comparisons to Chan’s [4] and Li’s [10] by using two error measures, i.e. the (FN) and (FP)
Chan Li Ours
0.2 0.15 0.1 0.05 0
Chan Li Ours
0.25
False Positive
False Negative
0.25
0.2 0.15 0.1 0.05
0
5
10
15
20
Power of Noise(%)
25
30
35
0
0
5
10
15
20
25
30
35
Power of Noise(%)
Fig. 7. An comparisons to Chan’s [4] and Li’s [10] when adding noise to the images. The horizontal axis shows the power of noise, while the vertical axis shows two error measures, i.e. the false negative (FN) and false positive (FP).
where {∗|∗ ∈ chan, li, our}. In Fig.6, six different scenes are choosed, and for each scene, 100 experiments are performed with the different initial contour. As shown in the figure, both the means and variances of the above two error measures are smaller than the traditional methods. That is to say, the proposed method is of better accuracy (smaller error mean) and stability (smaller error variance) than the traditional ones. Furthermore, the computation cost of our approach is similar with the Li’s approach. Fig.7 shows the comparisons when adding noise to the images. As illustrated in the figure, the error of our method increase slower than the traditional methods when the power of noise enlarging. That is, the proposed method is less sensitive to the noise. We also use Chan’s, Li’s , and our methods to medical image reconstruction. The visual result of our method is shown in Fig.8. As shown in this figure, the brain tissues are preserved well. The main reason is our segmentation method gives excellent the segmentation result. The numerical comparison of reconstruction
158
L. Wang, Z. Yu, and C. Pan
Scene
3D Reconstruction Results
Frame 1
Frame 1 To 20
Frame 21 To 40
Frame 61 To 80
Frame 81 To 100
Frame 41 To 60
Frame 30
Frame 100
Fig. 8. Medical image reconstruction result by our method Table 1. The comparison of reconstruction errors Algorithm Chan’s [4] Li’s [10] Ours FN 0.161 0.143 0.019 FP 0.016 0.126 0.017
error is shown in Tab.1. The error is defined as the mean of all slice’s FN and FP, given by N N = 1 = 1 FP FN FNi FPi (13) N i=1 N i=1 where N is the slice number, the FNi , FPi are the ith slice’s FN and FP (defined in Eqn.12). As shown in this table, the our two errors are obviously smaller than the Chan’s and Li’s.
4
Conclusions and Future Works
In this work, we propose a novel region-based level set segmentation method. The main contribution of this work is proposing the local order energy, which is used to represent the local intensity priori constraint. The proposed algorithm has been shown on a number of experiments to address the medical image segmentation problem both efficiently and effectively. Meanwhile, compared to the traditional methods, the proposed method has many advantages and improvements: 1. the detail information is well segmented; 2. the main drawbacks of the traditional local-homogeneous methods are well solved; 3. the proposed method is insensitive to noise. Future works will be adding the other local intensity constraints to improve the segmentation algorithm, or adding the shape information
Medical Image Segmentation Based on Novel Local Order Energy
159
to enhance the segmentation results. Moreover, the proposed algorithm can not only be used in medical image segmentation, but also be applied on other types of images. Acknowledgement. This work was supported by the National Natural Science Foundation of China (Grant No. 60873161 and Grant No. 60975037).
References 1. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. IJCV 22, 61–79 (1997) 2. Yezzi, A., Kichenassamy, S., Kumar, A., Olver, P., Tannenbaum, A.: A geometric snake model for segmentation of medical imagery. IEEE Trans. on MI 16, 199–209 (1997) 3. Li, C., Xu, C., Gui, C., Fox, M.D.: Level set evolution without re-initialization: A new variational formulation. In: CVPR, pp. 430–436 (2005) 4. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. on IP 10, 266–277 (2001) 5. Tsai, A., Anthony Yezzi, J., Willsky, A.S.: Curve evolution implementation of the mumfordcshah functional for image segmentation, denoising, interpolation, and magnification. IEEE Trans. on IP 10, 1169–1186 (2001) 6. Vese, L.A., Chan, T.F.: A multiphase level set framework for image segmentation using the mumford and shah model. IJCV 50, 271–293 (2002) 7. Li, C., Kao, C.Y., Gore, J.C., Ding, Z.: Implicit active contours driven by local binary fitting energy. In: CVPR, pp. 1–7 (2007) 8. An, J., Rousson, M., Xu, C.: γ-convergence approximation to piecewise smooth medical image segmentation. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 495–502. Springer, Heidelberg (2007) 9. Piovano, J., Rousson, M., Papadopoulo, T.: Efficient segmentation of piecewise smooth images. In: SSVMCV, pp. 709–720 (2007) 10. Li, C., Kao, C.Y., Gore, J.C., Ding, Z.: Minimization of region-scalable fitting energy for image segmentation. IEEE Trans. on IP 17, 1940–1949 (2008) 11. Li, C., Huang, R., Ding, Z., Gatenby, C., Metaxas, D., Gore, J.: A variational level set approach to segmentation and bias correction of images with intensity inhomogeneity. In: Metaxas, D., Axel, L., Fichtinger, G., Sz´ekely, G. (eds.) MICCAI 2008, Part II. LNCS, vol. 5242, pp. 1083–1091. Springer, Heidelberg (2008) 12. Li, C., Gatenby, C., Wang, L., Gore, J.C.: A robust parametric method for bias field estimation and segmentation of mr images. In: CVPR (2009) 13. Li, C., Li, F., Kao, C.Y., Xu, C.: Image segmentation with simultaneous illumination and reflectance estimation: An energy minimization approach. In: ICCV (2009) 14. Wang, L., Macione, J., Sun, Q., Xia, D., Li, C.: Level set segmentation based on local gaussian distribution fitting. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5994, pp. 293–302. Springer, Heidelberg (2010) 15. Xiang, S., Nie, F., Zhang, C., Zhang, C.: Interactive natural image segmentation via spline regression. IEEE Trans. on IP 18, 1623–1632 (2009)
Geometries on Spaces of Treelike Shapes Aasa Feragen1 , Francois Lauze1 , Pechin Lo1 , Marleen de Bruijne1,2 , and Mads Nielsen1 1 2
eScience Center, Dept. of Computer Science, University of Copenhagen, Denmark Biomedical Imaging Group Rotterdam, Depts of Radiology & Medical Informatics, Erasmus MC – University Medical Center Rotterdam, The Netherlands {aasa,francois,pechin,marleen,madsn}@diku.dk
Abstract. In order to develop statistical methods for shapes with a tree-structure, we construct a shape space framework for treelike shapes and study metrics on the shape space. The shape space has singularities, which correspond to topological transitions in the represented trees. We study two closely related metrics, TED and QED. The QED is a quotient euclidean distance arising from the new shape space formulation, while TED is essentially the classical tree edit distance. Using Gromov’s metric geometry we gain new insight into the geometries defined by TED and QED. In particular, we show that the new metric QED has nice geometric properties which facilitate statistical analysis, such as existence and local uniqueness of geodesics and averages. TED, on the other hand, has algorithmic advantages, while it does not share the geometric strongpoints of QED. We provide a theoretical framework as well as computational results such as matching of airway trees from pulmonary CT scans and geodesics between synthetic data trees illustrating the dynamic and geometric properties of the QED metric.
1
Introduction
Trees are fundamental structures in nature, where they describe delivery systems for water, blood and other fluids, or different types of skeletal structures. Trees and graphs also track evolution, for instance in genetics [3]. In medical image analysis, treelike shapes appear when studying structures such as vessels [4, 5] or airways [6–8], see Fig. 1. In more general imaging problems, they appear in the form of shock graphs [9] describing general shapes.
Fig. 1. Treelike structures found in airways and vessels in lungs [1], in breast cancer vascularization, and in retinal blood vessels [2] R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 160–173, 2011. Springer-Verlag Berlin Heidelberg 2011
Geometries on Spaces of Treelike Shapes
161
An efficient model for tree-shape would have applications such as shape recognition and classification. In medical imaging, this could give new insight into anatomical structure and biomarkers for diseases that affect shape – e.g. COPD affects the edge shapes and topological structures of airway trees [10]. We want to do statistical analysis on sets of treelike shapes, and in order to meaningfully define averages and modes of variation in analogy with the classical PCA analysis, we need at least local existence and uniqueness of geodesics and averages. Immediate questions are then: How do we model treelike structures and their variation? Can we encode global, tree-topological structure as well as local edgewise geometry in the geometry of a single shape space? Do geodesics exist, and are they unique? Do the dynamics of geodesics reflect natural shape variation? Here we provide the theoretical foundations needed to answer these questions, accompanied by a first implementation. We define shape spaces of ordered and unordered treelike shapes, where transitions in internal tree-topological structure are found at shape space singularities in the form of self-intersections. This is a general idea, which can be adapted to other situations where shapes with varying topology are studied – e.g. graphs or 3D shapes described by medial structures. The paper is organized as follows: In Section 1.1 we give a brief overview of related work. In Section 2 we define the treespace. Using Gromov’s approach to metric geometry [11] we gain insight into the geometric properties of two different metrics; one which is essentially tree edit distance (TED) and one which is a quotient euclidean distance (QED). We pay particular respect to the properties of geodesics and averages, which are essential for statistical shape analysis. In Section 4 we discuss how to overcome the computational complexity of both metrics. In particular, the computational cost can be drastically reduced by interpreting order as an edge semi-labeling, which can often be obtained through geometric or anatomic restrictions. In Section 5 we discuss a simple QED implementation, and in Section 6 we illustrate the properties of QED through computed geodesics and matchings for synthetic planar trees and 3D pulmonary airway trees. The paper contains mathematical results, but proofs are omitted due to length constraints. 1.1
Related Work
Metrics on sets of trees have been studied by different research communities in the past 20 years. The best-known approach in the computer vision community is perhaps TED, as used for instance by Sebastian et al [9] for comparing shapes via their shock graph representations. TED performs well for tasks such as matching and registration [9]. Our goal is, however, to adapt the classical shape statistics to treelike shapes. The TED metric will nearly always have infinitely many geodesics between two given trees, and thus it is no longer sufficient, since it becomes hard to meaningfully define and compute an average shape or find modes of variation. Another approach to defining a metric on treespace is that of Wang and Marron et al [12]. They define a metric on attributed trees as well as a “median-mean”
162
A. Feragen et al.
tree, which is not unique, and a version of PCA which finds modes of variation in terms of so called treelines, encoding the maximum amount of structural and attributal variation. Wang and Marron analyze datasets consisting of brain blood vessels, which are trees with few, long branches. The metric is not suitable for studying large trees with a lot of topological variation and noise such as airways, as it punishes structural changes much harder than shape variation with constant topological structure. A rather different approach is that of Jain and Obermayer [13], who define metrics on graphs. Here, graphs are modeled as incidence matrices, and the space of graphs, which is a quotient by the group of vertex relabelings, is given a metric inherited from Euclidean space. Means are computed using Lipschitz analysis, giving fast computations; however the model is rigid when it comes to modeling topological changes, which is essential for our purposes. We have previously [14] studied geodesics between small planar embedded trees in the same type of singular shape space as studied here. These geodesics are fundamental in the sense that they represent the possible structural changes found locally even in larger trees.
2
Geometry of Treespace
Before defining a treespace and giving it a geometric structure, let us discuss which properties are desirable. In order to compute average trees and analyze variation in datasets, we require existence and uniqueness properties for geodesics. When geodesics exist, we want the topological structure of the intermediate trees to reflect the resemblance in structure of the trees being compared – in particular, a geodesic passing through the trivial one-vertex tree should indicate that the trees being compared are maximally different. Perhaps more important, we would like to compare trees where edge matching is inconsistent with tree topology as in Fig. 2a; specifically, we would like to find geodesic deformations in which the tree topology changes when we have such edge matchings, for instance as in Fig. 2b.
(a) Edge matching
(b) Geodesic candidate
Fig. 2. A good metric must handle edge matchings which are inconsistent with tree topology
2.1
Representation of Trees
We represent any treelike (pre-)shape as a pair (T , x) consisting of a rooted, planar binary tree T = (V, E, r) with edge attributes describing edge shape. Here, T describes the tree topology, and the the attributes describe the edge geometry. The attributes are represented by a map x : E → A or, equivalently, by a point
Geometries on Spaces of Treelike Shapes
163
x ∈ AE , where A is the attribute space. Tree-shapes which are not binary are represented by the maximal binary tree in a very natural way by allowing constant edges, represented by the zero scalar or vector attribute. Endowing an internal edge with a zero attribute corresponds to collapsing that edge, see Fig. 4a, and hence an arbitrary attributed tree can be represented as an attributed binary tree. The intuitive idea is illustrated in Fig. 3.
Fig. 3. Treelike shapes are encoded by an ordered binary tree and a set of attributes describing edge shape
The attributes could be edge length, landmark points or edge parametrizations, for instance. In this work, we use trees whose attributes are open curves translated to start at the origin, described by a fixed number of landmark points. Thus, thorughout the paper, the attribute space is (Rd )n where d = 2 or 3 and n is the number of landmark points per edge. Collapsed edges are represented by a sequence of origin points. We need to represent trees of different sizes and structures in a unified way in order to compare them. As a first step, we describe all shapes using the same binary tree T to encode the topology of all the tree-shapes in the treespace – by choosing a sufficiently large binary tree we can represent all the trees in our dataset by filling out with collapsed edges. We call T the maximal binary tree. When given a left-right order on every pair of children in a binary tree, there is an implicitly defined embedding of the tree in the plane. Hence we shall use the terms “planar tree” and “ordered tree” interchangingly, depending on whether we are interested in the planar properties or the order of the tree. We initially define our metric on the set of ordered binary trees; later we use it to compute distances between unordered trees by considering all possible orders. In Section 4 we discuss how to handle the computational challenges of the unordered case. depth n with edges E, any atGiven an ordered maximal binary tree Tn of tributed tree T is represented by a point in X = e∈E (Rd )n . We call X the tree preshape space, since it parametrizes all possible shapes. 2.2
The Space of Trees as a Quotient – A Singular Treespace
We fix the maximal depth n binary tree Tn which encodes the connectivity of the trees. In the preshape space defined above, there will be several points corresponding to the same tree, as illustrated in Fig. 4a. We go from preshapes to shapes by identifying those preshapes which define the same shape. Consider two ordered tree-shapes where internal edges are collapsed. The orders of the original trees induce orders on the collapsed trees. We consider two ordered tree-shapes to be the same when their collapsed ordered topological structures are identical, and the edge attributes on corresponding non-collapsed edges are identical as well, as in Fig. 4a. Thus, tree identifications come with
164
A. Feragen et al.
(a)
(b)
Fig. 4. (a) Higher-degree vertices are represented by collapsing internal edges. The dotted line indicates zero attribute, which gives a collapsed edge. We identify those tree preshapes whose collapsed structures are identical – here x1 and x2 represent the same tree T . (b) The TED moves: Remove an edge a, add an edge b, deform an edge c.
an inherent bijection of subsets of E: If we identify x, y ∈ X = (RN )E , denote E1 = {e ∈ E|xe = 0}, E2 = {e ∈ E|ye = 0}; the identification comes with an order preserving bijection ϕ : E1 → E2 identifying those edges that correspond to the same edge in the collapsed tree-shape. Note that ϕ will also correspond to similar equivalences of pairs of trees with the same topology, but other attributes. The bijection ϕ next induces a bijection Φ : V1 → V2 given by Φ : (xe ) → / E1 } and V2 = {x ∈ X|xe = 0 if e ∈ / E2 } (xϕ(e) ). Here, V1 = {x ∈ X|xe = 0 if e ∈ are subspaces of X where, except for at the axes, the topological tree structure is constant, and for x ∈ V1 , Φ(x) ∈ V2 describes the same shape as x. We define a map Φ for each pair of identified tree-structures, and form an equivalence on X by setting x ∼ Φ(x) for all x and Φ. For each x ∈ X we denote ¯ = (X/ ∼) = by x¯ the equivalence class {x ∈ X|x ∼ x}. The quotient space X {¯ x|x ∈ X} of equivalence classes x ¯ is the space of treelike shapes. The geometric interpretation of the identification made in the quotient is that we are folding and gluing our space along the identified subspaces; i.e. when x1 ∼ x2 we glue the two points x1 and x2 together. See the toy quotient space in Fig. 5 for an intuitive lower-dimensional illustration. 2.3
Metrics on Treespace
Given a metric d on Euclidean space X = e∈E RN we define the standard quo¯ = X/ ∼ by setting tient pseudometric [15] d¯ on the quotient space X k ¯ d(¯ x, y¯) = inf d(xi , yi )|x1 ∈ x ¯, yi ∼ xi+1 , yk ∈ y¯ . (1) i=1
This amounts to finding the optimal path from x¯ to y¯ passing through the identified subspaces, as shown in Fig. 5. We define two metrics on X, which come from two different ways of combining the individual edge distances: The metrics d1 and d2 on X = e∈E RN are in 2 duced by the norms x−y1 = e∈E xe −ye and x−y2 = e∈E xe − ye . From now on, d and d¯ will denote either the metrics d1 and d¯1 , or d2 and d¯2 . We have the following: ¯ which is a conTheorem 1. The distance function d¯ restricts to a metric on X, tractible, complete, proper geodesic space.
Geometries on Spaces of Treelike Shapes
165
This means, in particular, that given any two trees, we can always find a geodesic between them in both metrics d¯1 and d¯2 . Note that d¯1 is essentially the classical tree edit distance (TED) metric, where the distance between two trees is the minimal sum of costs of edge edits needed to transform one tree into another, see Fig. 4b. The d¯2 is a descent of the Euclidean metric, and geodesics in this metric are concatenations of straight lines in flat regions. We call it the QED metric, for quotient euclidean distance. In Section 3 we describe the properties of these metrics with example geodesic deformations between simple trees. Using methods from metric geometry [11] we obtain results on curvature and average trees. We consider two different types of average tree; namely the centroid, as defined and computed in [3], and the circumcenter, defined as follows: ¯ with diameter D a circumcenter of S is defined as a point Given a set S ⊂ X ¯ ¯ x, D/2) = {¯ ¯ d¯2 (¯ x ¯ ∈ X such that S ⊂ B(¯ z ∈ X| x, z¯) ≤ D/2}. ¯ has ¯ ¯∈X Theorem 2. i) Endow X with the QED metric d¯2 . A generic point x ¯ a neighborhood U ⊂ X in which the curvature is non-positive. At non-generic ¯ d¯2 ) is unbounded. points, the curvature of (X, ¯ there exists ¯ ¯ ∈ X, ii) Endow X with the QED metric d¯2 . Given a generic point x a radius rx¯ such that sets contained in the ball B(¯ x, rx¯ ) have unique centroids and circumcenters. ¯ d¯1 ) is everywhere ¯ with the TED metric d¯1 . The curvature of (X, iii) Endow X ¯ d¯1 ) does not have locally unique geodesics anywhere. unbounded, and (X, The practical meaning of Theorem 2 is that i) we can use techniques from metric geometry to look for QED averages, ii) for datasets whose scattering is not too large, there exist unique centroids and circumcenters for the QED metric, and iii) we cannot use the same techniques in order to prove existence or uniqueness of centerpoints for the TED metric; in fact, any geometric method which requires bounded curvature is going to fail for the TED metric. This result motivates our study of the QED metric. 2.4
From Planar Trees to Spatial Trees
The world is not flat, and for most applications it is necessary to study embedded trees in R3 . As far as attributes in terms of edge embeddings and landmarks are concerned, this is not very different from the planar embedded trees – the main difference from the R2 case comes from the fact that trees in R3 have no canonical edge order. The left-right order on children of planar trees gives an implicit preference for edge matchings, and hence reduces the number of possible matches. When we no longer have this preference, we generally need to consider all orderings of the same tree and choose the one which minimizes the distance. ¯ = X/G, ¯ We define the space of spatial treelike shapes as the quotient X where G is the group of reorderings of the standard tree; G is a finite group. The metric ¯ We can prove: ¯ induces a quotient pseudometric d¯ on X. d¯ on X Theorem 3. For d¯ induced by either d¯1 or d¯2 , the function d¯ is a metric and the ¯ is a contractible, complete, proper geodesic space. ¯ d) space (X,
166
A. Feragen et al.
(a)
(b)
Fig. 5. (a) A 2-dimensional toy version of the folding of Euclidean space along two linear subspaces, along with geodesic paths from x ¯ to y¯ and from x ¯ to y¯ . (b) Local¯ ¯ ¯ to-global properties of TED: d1 (T1 , T2 ) = d1 (T1,1 , T2,1 ) + d1 (T1,2 , T2,2 ).
While considering all different possible orderings of the tree makes perfect sense from the geometric point of view, in reality this becomes an impossible task as the size of the trees grow beyond a few generations. In real applications we can, however, efficiently reduce complexity by taking tree- and treespace geometry into account. This is discussed in Section 4.
3
Comparison of the QED and TED Metrics
In this section we list the main differences between the TED and QED metrics and compare their performance on the small trees studied in [14]. ¯ are ¯ and X Geometry. As shown in Theorem 1 and Theorem 3 above, both X complete geodesic spaces. However, by Theorem 2, the QED metric gives locally non-positive curvature at generic points, while the TED metric gives unbounded ¯ This means that we cannot imitate the classical stacurvature everywhere on X. tistical procedures on shape spaces using the TED metric. Note also that the QED metric is the quotient metric induced from the Euclidean metric on the preshape space X, making it the obvious choice for a metric seen from the shape space point of view. Computation. The TED metric has nice local-to-global properties, as illustrated in Fig. 5b. If the trees T1 and T2 are decomposed into subtrees T1,1 , T1,2 and T2,1 , T2,2 as in Fig. 5b such that the geodesic from T1 to T2 restricts to geodesics between T1,1 and T2,1 as well as T2,1 and T2,2 , then d(T1 , T2 ) = d(T1,1 , T2,1 ) + d(T2,1 , T2,2 ). This property is used in many TED algorithms, and the same property does not hold for the QED metric. Performance. We have previously [14] studied fundamental geodesic deformations for the QED metric; that is, deformations between depth 3 trees which encode the local topological transformations found in generic tree-geodesics. These deformations are important for determining whether an internal tree-topological transition is needed to transform one (sub)tree into another – i.e. whether the correct edge registration is one which is inconsistent with the tree topology.
Geometries on Spaces of Treelike Shapes
(a)
167
(b)
Fig. 6. (a) The fundamental geodesic deformations: geodesic 1 goes through an internal structural change, while geodesic 2 does not, as the disappearing branches approach the zero attribute. (b) Two options for structural change.
To compare the TED and QED metric on small, simple trees, consider the two tree-paths in Fig. 6b, where the edges are endowed with non-negative scalar attributes a, b, c, d, e describing edge length. Path 1 indicates a matching of the identically attributed edges c and d, while Path 2 does not make the match. Now, √ the cost of Path 1 is 2e in both metrics, while the cost of Path 2 is 2 c2 + d2 in the QED metric and 2(c + d) in the TED metric. In particular, TED will chose to identify the c and d edges whenever e2 ≤ c2 + 2cd + d2 , while QED makes the match whenever e2 ≤ (c2 + d2 ). That is, TED will be more prone to internal structural changes than QED. This is also seen empirically in the comparison of TED and QED matching in Fig. 7. Note that although the TED is more prone to matching trees with different tree-topological structures, the matching is similar.
(a) Matching in the QED metric.
(b) Matching in the TED metric.
Fig. 7. Given a set of five data trees, we match each to the four others in both metrics
4
Computation and Complexity
Complexity is a problem with computing both TED and QED distances, in particular for 3D trees, which do not have a canonical planar order. Here we discuss how to use geometry and anatomy to find approximations of the metric whose complexity is significantly reduced.
168
A. Feragen et al.
Ordered trees: Reducing complexity using geometry. The definition of d¯ in (1) opens for considering infinitely many possible paths. However, we can significantly limit the search for a geodesic by taking the geometry of treespace into account. For instance, it is enough to consider paths going through each structural transition at most once: Lemma 1. For the metrics d¯1 and d¯2 on X, the shortest path between two points ¯ passes through each identified subspace at most once. in X Another way of limiting the complexity of real computations is to find a generic shape structure. We show that locally in the geodesic, the only internal topological transitions found will be those described by the fundamental geodesic deformations in Section 3. Theorem 4. i) Treelike shapes that are truly binary (i.e. their internal edges are not collapsed) are generic in the space of all treelike shapes. ii) The tree-shape deformations whose local structure is described by the fundamental geodesic deformations in Fig. 6a are generic in the set of all tree deformations. Moreover, the set of pairs of trees whose geodesics are of this form, is generic in the space of all pairs of trees. Genericity basically means that when a random treelike shape is selected, the probability of that shape being truly binary is 1. This does not mean, however, that non-binary trees do not need to be considered. While non-binary trees do not appear as randomly selected trees, they do appear in paths between randomly selected pairs of trees, as in Fig. 6a above. This also means that we can, in fact, work on a less complicated semi-preshape-space where we, instead of identifying all possible representations of the same shape, restrict ourselves to making the identifications illustrated by Fig. 6a. This is similar to the generic form of shock graphs and generic transitions between shock graphs found by Giblin and Kimia [16]. The notions of genericity in the two settings are, however, different since in [16], genericity is defined with respect to the Whitney topology on the space of the corresponding shape boundary parametrizations. Although we do run into non-binary treelike shapes in real-life applications, for instance when studying airway trees, this can be interpreted as an artifact of resolution rather than as true higher-degree vertices. Indeed, the airway extraction algorithms record trifurcations when the relative distances are below certain treshold values. Finally, it is often safe to assume that only very few structural changes will actually happen. For the airway trees studied in Section 6.2, we find empirically that it is enough to allow for one structural change in each depth 3 subtree. Using these arguments, it becomes feasible to computationally handle semilarge structures. In many medical imaging applications, such as airways or blood vessels, we are not interested in studying tree-deformations between very large networks or trees, since beyond a certain point, the tree-structure stops following a predetermined pattern and becomes a stochastic variable.
Geometries on Spaces of Treelike Shapes
169
Unordered trees: Reducing complexity using geometry- or anatomybased semi-labeling schemes. It is well known that the general problem of computing TED-distances between unordered trees is NP-complete [17], and the QED metric is probably generally not less expensive to compute, as indicated also by Theorem 3. Here we discuss how to use geometry and anatomy to find approximations of the metric whose complexity is significantly reduced. In particular, trees appearing in applications are usually not completely unordered, but are often semi-labeled. Semi-labelings can come from geometric or anatomical properties as in the pulmonary airway trees studied in Example 1 below, or may be obtained by a coarser registration method. (Semi-)labelings can also come from a TED distance computation or approximation, which is a reasonable way to detect approximate structural changes since the TED and QED give similar matchings, as seen in Fig. 7. Example 1 (Semi-labeling of the upper airway tree). In Fig. 9a, we see a “standard” airway tree with branch labels (or at least the first generations of it). Most airway trees, as the one shown to the left in Fig. 1, have similar, but not necessarily identical, topological structure to this one, and several branches, especially of low generation, have names and can be identified by experts. The top three generations of the airway tree serve very clear purposes in terms of anatomy. The root edge is the trachea; the second generation edges are the left and right bronchi; and the third generation edges lead to the top and towards the middle and lower lung lobes on the left, and to the top and bottom lobes on the right. Thus we find a canonical semi-labeling of the airway tree, which we use to simplify the problem of computing airway distances in Section 6.2.
5
QED Implementation
We explain a simple implementation of the QED metric. As shown in the next sections, our results are promising – the geodesics for synthetic data show the expected correspondences between edges, and in our experiments on 6 airway trees extracted from CT scans of 3 patients made at two different times, we recognize the correct patient in 5 out of 6 cases. Alignment of trees. We translate all edges to start at 0. We do not factor out scale in the treespace, because in general, we consider scale an important feature of shape. Edge scale in particular is a critical property, as the dynamics of appearing and disappearing edges is directly tied to edge size. Edge shape comparison. We represent each edge in a tree by a fixed number of landmark points – in our case 6 – evenly distributed along the edge, the first one at the origin (and hence neglected). The distance between two edge attributes v1 , v2 ∈ (Rd )5 is defined as the Euclidean distance between them. Although simple, this distance measure does take the scale of edges into account to some degree since they are aligned.
170
A. Feragen et al.
Fig. 8. Different options for structural changes in the left hand side of the depth 3 tree. Topological illustration on the left, corresponding tree-shape illustration on the right.
Implementation. In ordered depth 3 trees we make an implementation using Algorithm 1, where the number of structural changes taking place in the whole tree is limited to either 1 or 2, and in the final case we also rule out the case where the two changes happen along the same “stem”. This leaves us with the options for structural changes illustrated in Fig. 8, applied to the left and right half tree. The complexity of Algorithm 1 is O(nk · C(n, k)), where n is the number of internal vertices, k is the number of structural changes and C(n, k) is the maximal complexity of the optimization in line 8. Algorithm 1. Computing geodesics between ordered, rooted trees with up to k structural changes 1: x, y planar rooted depth n binary trees 2: S = {S} set of ordered identified pairs S = {S1 , S2 } of subspaces of X corresponding to internal topological changes, corresponding to a subspace S ¯ s.t. if {S1 , S2 } ∈ S, then also {S2 , S1 } ∈ S. of X, ˜ ≤ k do 3: for S˜ = {S 1 , . . . , S s } ⊂ S with |S| i i 4: for p ∈ S with representatives pi1 ∈i S1 and pi2 ∈ S2i do 5: p = (p1 , p2 , . . . , ps ) 2 i 6: f (p) = min{d2 (x, p1 ) + s−1 j=1 ds (pi , pi+1 ) + d2 (ps , y)} 7: end for 8: dS˜ = min f (p)|p = (p1 , . . . , ps ), pi ∈ S i , S˜ = {S 1 , . . . , S s } 9: 10: 11: 12: 13: 14:
6
pS˜ = {pi1 , pi2 }si=1 = argminf (p) end for ˜ ≤ k} d = min{dS |S˜ ⊂ S, |S| p = {p1 , p2 }si=1 = {pS˜ |dS˜ = d} geodesic = g = {x → p11 ∼ p12 → p21 ∼ p22 → . . . → ps1 ∼ ps2 → y} return d, g
Experimental Results
The QED metric is new, whereas the matching properties of the TED metric are well known [9]. In this section we present experimental results on real and synthetic data which illustrate the geometric properties of the QED metric. The experiments on airway trees in Section 6.2 show, in particular, that it is feasible to compute the metric distances on real 3D data trees.
Geometries on Spaces of Treelike Shapes
6.1
171
Synthetic Planar Trees of Depth 3
We have uploaded movies illustrating geodesics between planar depth 3 trees, as well as a matching table for a set of planar depth 3 trees, to the webpage http://image.diku.dk/aasa/ACCVsupplementary/geodesicdata.html. Note the geometrically inutitive behaviour of the geodesic deformations. 6.2
Results in 3D: Pulmonary Airway Trees
We also compute geodesics between subtrees of pulmonary airway trees. The airway trees were first segmented from low dose screening computed tomography (CT) scans of the chest using a voxel classification based airway tree segmentation algorithm by Lo et al. [18]. The centerlines were then extracted from the segmented airway trees using a modified fast marching algorithm based on [19], which was originally used in [18] for measuring the length of individual branches. Since the method also gives connectivity of parent and children branches, a tree structure is obtained directly. Leaves with a volume less than 11 mm3 were assumed to be noise and pruned away, and the centerlines were sampled with 5 landmark points on each edge. Six airways extracted from CT scans taken from three different patients at two different times were analyzed, restricting to the first six generations of the airway tree. The first three generations were identified and labeled as in Ex. 1, leaving us with four depth 3 subtrees representing the 4th to 6th generations of a set of 3D pulmonary airway trees. Algorithm 1 was run with both one and two structural changes, and no subtree geodesics (out of 24) had more than one structural change. Thus, we find it perfectly acceptable to restrict our search to paths with few structural changes.
(a)
(b)
Fig. 9. (a) Standard airway tree with branch labels. Drawing from [6] based on [20]. (b) Table of sum of QED distances for the five airway subtrees for 6 airway trees retrieved from CT scans of three patients taken at two different times, denoted P(a, i) where a denotes patient and i denotes time.
In the depth 3 subtrees we expect the deformation from one lung to the other to include topological changes, and we compare them using the unordered QED metric implementation on the subtrees which are cut off after the first 3 generations. As a result we obtain a deformation from the first depth 6 airway tree to
172
A. Feragen et al.
the second, whose restriction to the subtrees is a geodesic. In this experiment we study four airways originating from CT scans of two different patients acquired at two different times, and we obtain the geodesic distances and corresponding matching shown in Fig. 9b. The five subtree distances (top three generations and four subtrees) were added up to give the distance measure in Fig. 9b, and we see that for 5 out of 6 images, the metric recognizes the second image from the same patient.
7
Conclusions and Future Work
Starting from a purely geometric point of view, we define a shape space for treelike shapes. The intuitive geometric framework allows us to simultaneously take both global tree-topological structure and local edgewise geometry into account. We define two metrics on this shape space, QED and TED, which give the shape space a geodesic structure (Theorems 1 and 3). QED is the geometrically natural metric, which turns out to have excellent geometric properties which are essential for statistical shape analysis. In particular, the QED metric has local uniqueness of geodesics and local existence and uniqueness for two versions of average shape, namely the circumcentre and the centroid (Theorem 2). TED does not share these properties, but has better computational properties. Both metrics are generally NP hard to compute for 3D trees. We explain how semi-labeling schemes and geometry can be used to overcome the complexity problems, and illustrate this by computing QED distances between trees extracted from pulmonary airway trees as well as synthetic planar data trees. Our future research will be centered around two points: Development of nonlinear statistical methods for the singular tree-shape spaces, and finding fast algorithms for an approximate QED metric. The latter will allow us to actually compute averages and modes of variation for large, real 3D data trees. Acknowledgements. This work is partly funded by the Lundbeck foundation, the Danish Council for Strategic Research (NABIIT projects 09−065145 and 09− 061346) and the Netherlands Organisation for Scientific Research (NWO). We thank Dr. J.H. Pedersen from the Danish Lung Cancer Screening Trial (DLCST) for providing the CT scans. We thank Marco Loog, Søren Hauberg and Melanie Ganz for helpful discussions during the preparation of the article.
References 1. Pedersen, J., Ashraf, H., Dirksen, A., Bach, K., Hansen, H., Toennesen, P., Thorsen, H., Brodersen, J., Skov, B., Døssing, M., Mortensen, J., Richter, K., Clementsen, P., Seersholm, N.: The Danish randomized lung cancer CT screening trial - overall design and results of the prevalence round. J. Thorac. Oncol. 4, 608–614 (2009)
Geometries on Spaces of Treelike Shapes
173
2. Staal, J., Abramoff, M., Niemeijer, M., Viergever, M., van Ginneken, B.: Ridge based vessel segmentation in color images of the retina. TMI 23, 501–509 (2004) 3. Billera, L.J., Holmes, S.P., Vogtmann, K.: Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27, 733–767 (2001) 4. Charnoz, A., Agnus, V., Malandain, G., Soler, L., Tajine, M.: Tree matching applied to vascular system. In: Brun, L., Vento, M. (eds.) GbRPR 2005. LNCS, vol. 3434, pp. 183–192. Springer, Heidelberg (2005) 5. Chalopin, C., Finet, G., Magnin, I.E.: Modeling the 3D coronary tree for labeling purposes. MIA 5, 301–315 (2001) 6. Tschirren, J., McLennan, G., Pal´ agyi, K., Hoffman, E.A., Sonka, M.: Matching and anatomical labeling of human airway tree. TMI 24, 1540–1547 (2005) 7. Kaftan, J., Kiraly, A., Naidich, D., Novak, C.: A novel multi-purpose tree and path matching algorithm with application to airway trees. In: SPIE, vol. 6143 I (2006) 8. Metzen, J.H., Kr¨ oger, T., Schenk, A., Zidowitz, S., Peitgen, H.-O., Jiang, X.: Matching of tree structures for registration of medical images. In: Escolano, F., Vento, M. (eds.) GbRPR. LNCS, vol. 4538, pp. 13–24. Springer, Heidelberg (2007) 9. Sebastian, T.B., Klein, P., Kimia, B.: Recognition of shapes by editing their shock graphs. TPAMI 26, 550–571 (2004) 10. Nakano, Y., Muro, S., Sakai, H., Hirai, T., Chin, K., Tsukino, M., Nishimura, K., Itoh, H., Par´e, P.D., Hogg, J.C., Mishima, M.: Computed tomographic measurements of airway dimensions and emphysema in smokers. Correlation with lung function. Am. J. Resp. Crit. Care Med. 162, 1102–1108 (2000) 11. Gromov, M.: Hyperbolic groups. In: Essays in group theory. Math. Sci. Res. Inst. Publ., vol. 8, pp. 75–263 (1987) 12. Wang, H., Marron, J.S.: Object oriented data analysis: sets of trees. Ann. Statist. 35, 1849–1873 (2007) 13. Jain, B.J., Obermayer, K.: Structure spaces. J. Mach. Learn. Res. 10, 2667–2714 (2009) 14. Feragen, A., Lauze, F., Nielsen, M.: Fundamental geodesic deformations in spaces of treelike shapes. In: Proc ICPR (2010) 15. Bridson, M.R., Haefliger, A.: Metric spaces of non-positive curvature. Springer, Heidelberg (1999) 16. Giblin, P.J., Kimia, B.B.: On the local form and transitions of symmetry sets, medial axes, and shocks. IJCV 54, 143–156 (2003) 17. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42, 133–139 (1992) 18. Lo, P., Sporring, J., Ashraf, H., Pedersen, J.J., de Bruijne, M.: Vessel-guided airway tree segmentation: A voxel classification approach. Medical Image Analysis 14, 527–538 (2010) 19. Schlath¨ olter, T., Lorenz, C., Carlsen, I.C., Renisch, S., Deschamps, T.: Simultaneous segmentation and tree reconstruction of the airways for virtual bronchoscopy. In: SPIE, vol. 4684, pp. 103–113 (2002) 20. Boyden, E.A.: Segmental anatomy of the lungs. McGraw-Hill, New York (1955)
Human Pose Estimation Using Exemplars and Part Based Refinement Yanchao Su1 , Haizhou Ai1 , Takayoshi Yamashita2 , and Shihong Lao2 1
Computer Science and Technology Department, Tsinghua, Beijing 100084, China 2 Core Technology Center, Omron Corporation, Kyoto 619-0283, Japan
Abstract. In this paper, we proposed a fast and accurate human pose estimation framework that combines top-down and bottom-up methods. The framework consists of an initialization stage and an iterative searching stage. In the initialization stage, example based method is used to find several initial poses which are used as searching seeds of the next stage. In the iterative searching stage, a larger number of body parts candidates are generated by adding random disturbance to searching seeds. Belief Propagation (BP) algorithm is applied to these candidates to find the best n poses using the information of global graph model and part image likelihood. Then these poses are further used as searching seeds for the next iteration. To model image likelihoods of parts we designed rotation invariant EdgeField features based on which we learnt boosted classifiers to calculate the image likelihoods. Experiment result shows that our framework is both fast and accurate.
1
Introduction
In recent year human pose estimation from a single image has become an interesting and essential problem in computer vision domain. Promising applications such as human computer interfaces, human activity analysis and visual surveillance always rely on robust and accurate human pose estimation results. But human pose estimation still remains a challenging problem due to the dramatic change of shape and appearance of articulated pose. There are mainly two categories of human pose estimation methods: Topdown methods and bottom-up methods. Top-down methods, including regression based methods and example based method, concern of the transformation between human pose and image appearance. Regression based methods [1, 2] learns directly the mapping from image features to human pose. Example based methods [3–7] find a finite number of pose examplars that sparsely cover the pose space and store image features corresponding to each examplar, the resultant pose is obtained through interpolation of the n closest examplars. Bottom-up methods, which are mainly part based methods [8–12], divide human body into several parts and use graph models to characterize the whole body. First several candidates of each part are found using learned part appearance model and then the global graph models are used to assemble each part into a whole body. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 174–185, 2011. c Springer-Verlag Berlin Heidelberg 2011
Human Pose Estimation Using Exemplars and Part Based Refinement
175
Among all these works, top-down methods are usually quick in speed but their performances rely on the training data. Bottom up methods are more accurate but the exhaustive search of body parts is too time consuming for practical applications. Inspired by the works of Active Shape Model (ASM) [13], we find that iterative local search could give accurate and robust result of shape alignment with acceptable computation cost given a proper initialization. We design a novel iterative inferring schema for bottom-up part based pose estimation. And to model the image likelihood of each part, we designed rotation invariant EdgeField features based on which we learn boosted classifiers for each body part. To provide initial poses to the part based method, we utilize a sparse example based method that gives several best matches with little time cost based on the cropped human body image patch given by a human detector. The combination of top-down and bottom-up methods enables us to achieve accurate and robust pose estimation, and the iterative inferring schema boosts the speed of pose inference with no degeneracy on accuracy. The rest of this paper is organized as follows: section 2 presents an overview of our pose estimation framework, section 3 describes the top-down example based stage, and section 4 is the bottom-up iterative searching stage. Experiments and conclusions are then given in section 5 and section 6.
2
Overview of Our Framework
A brief flowchart of our framework is given in Fig. 1.
6HDUFKLQJ &URSSHG 6HDUFKLQJ 6HDUFKLQJ 6HHGV ,PDJH &DQGLGDWHV 6HHGV ([DPSOHEDVHG *HQHUDWH %3LQIHUHQFH &RYHUDJH" LQLWLDOL]DWLRQ FDQGLGDWHV
5HVXOW
Fig. 1. Examples in dataset 2
The pose estimation framework uses the cropped image patch given as input and consists of two main stages: the initialization stage and the refining stage. In the initialization stage, a sparse top-down example based method is used to find the first n best matches of the examplars, the poses of these examplars are used as the searching seeds in the searching stage. In the refining stage, we utilize a set of boosted body part classifier to measure the image likelihood and a global tree model to restrict the body configuration. Inspired by the works of Active Shape Model, we design an iterative searching
176
Y. Su et al.
algorithm to further refine the pose estimation result. Unlike ASM which finds the best shape in each iteration, we find the first n best poses using BP inference on the tree model and use them as searching seeds for the next iteration. The refined pose estimation result is given when the iteration converges.
3
Example Based Initialization
The objective of example based initialization stage is to provide several searching seeds for the part based refinement stage, so this stage has to be fast and robust. In some previous example-based methods[3, 4, 7], a dense cover of pose space which uses all the training samples as examplars is adopted to achieve better performance, but in our case, a sparse examplar set is enough to get initial poses for the refining stage. We use the Gaussian Process Latent Variable Model (GPLVM) [14] to obtain a low dimensional manifold and use the active set of GPLVM as examplars. An illustration of the GPLVM model and the examplars are given in Fig. 2.
Fig. 2. Illustration of GPLVM and examplars
The adopted image feature is an important aspect of example-based methods. In many previous works, silhouettes of human body are usually used as the templates of examplars. But in practice, it is impossible to get accurate silhouettes from a single image with clutter background. In [3] the author adopts histograms of oriented gradients (HOG) [15] as the image descriptor to characterize the appearance of human body. However, there are two main limitations of using HOG descriptor in a tight bounding box: firstly, the bounding box of human varies dramatically when the pose changes, secondly, a single HOG descriptor isnt expressive enough considering the furious change of the appearance of human body in different pose. So we first extend the tight bounding rectangle to a square, and then divide the bounding square equally into 9 sub-regions,
Human Pose Estimation Using Exemplars and Part Based Refinement
177
HOG descriptors in the sub-regions are concatenated into the final descriptor. And for each examplar, we assigned a weight to each sub-region which represents the percentage of foreground area in the sub-region. Given an image and the bounding box, we first cut out a square image patch based on the bounding box, and then HOGs of all the sub-regions are computed. The distance between the image and an examplar is given by the weighted sum of the Euclid distances of HOGs in each sub-region. We find the first n examplars with shortest distance as the initial poses.
4 4.1
Part Based Refinement Part Based Model
It is natural that human body can be decomposed into several connected parts. Similar with previous works, we model the whole body as a Markov network as shown in Fig. 3(left). The whole body is represented as a tree structure in which each tree node corresponds to a body part and is related to its adjacent parts and its image observation. Human pose can be represented as the configuration of the tree model: L = {l1 , l2 , ..., ln }, and the state of each part is determined by the parameter of its enclosing rectangle: li = {x, y, θ, w, h} (Fig. 3(right)). Each part is associated with an image likelihood, and each edge of the tree between connected part i and part j has an associated, that encodes the restriction of configuration of part i and potential function part j.
xi
pj
pi
ߠ
ߠ
( x, y) h
xj
w
VWDWH REVHUYDWLRQ
Fig. 3. Left: Markov network of human body. Middle:the constraint between two connected parts. Right: the state of a body part.
As shown in Fig. 3(middle), the potential function has two components corresponding to the angle and the position: p θ (li , lj ) + ψij (li , lj ) ψij (li , lj ) = ψij
(1)
The constraint of angle is defined by the Mises distribution: θ (li , lj ) = ek cos(θ−μ) ψij
(2)
178
Y. Su et al.
And the potential function on position is a piece-wise function: p (li , lj ) = { ψij
2
αe−β||pi −pj || , ||pi − pj ||2 < t 0 , ||pi − pj ||2 ≥ t
(3)
The parameter α, β can be learnt from the ground truth data. The piece-wise definition of the potential function not only guarantees the connection between adjacent parts, but also can speed up the inference procedure. The body pose can be obtained by optimizing: p(L|Y ) ∝ p(Y |L)p(Y ) = ψi (li ) ψij (li , lj ) (4) i
4.2
i∈Γ (p)
Image Likelihood
As in previous works, body silhouettes are the most direct cue to measure the image likelihood. However, fast and accurate silhouettes extraction is another open problem. As proven in previous works, boosting classifiers with gradient based features such as HOG feature [15] is an effective way of measuring image likelihood in pedestrian detection and human pose estimation. However, in human pose estimation, the orientation of body parts change dramatically when the pose changes, so when measuring image likelihood of body parts we need to rotate the images to compute the HOG feature values. This time consuming image rotation can be eliminated when using a kind of rotation invariant features. Based on this point, we designed an EdgeField feature. Given an image, we can take each pixel as a charged particle and then the image as a 2D field. Following the definition of electric field force, we can define the force between two pixels p and q: kdp · dq r (5) ||r||3 Where dp , dq are the image gradient on pixel p, q. and r is the vector pointing from p to q. the amplitude of the force is in proportion to the gradient amplitude of p and q, and the similarity between the gradient of p and q, and is in inverse proportion to the square of the distant between p and q. Given a template pixel p and an image I, we can compute the force between the template pixel and the image by summing up the force between the template pixel and each image pixel: f (p, q) =
f (p, I) =
kdp · dq q∈I
||r||3
r
(6)
By applying orthogonal decomposition on f(p,I), we can get: f (p, I) =
f (p, I) · ex f (p, I) · ey
=k
dp · q∈Γ (p) dp · q∈Γ (p)
rq ·ex ||rq ||3 rq ·ey ||rq ||3
=
dp · Fx (p) dp · Fy (p)
= dp F (p) (7)
Human Pose Estimation Using Exemplars and Part Based Refinement
179
Where ex and ey are the unit vector of x and y directions. And F (p) is a 2 by 2 constant matrix in each position p in image I. we call F (p) the EdgeField. Our EdgeField feature is define as a short template segment of curve composed of several points with oriented gradient. The feature value is defined as the projection of the force between the template segment and the image to a normal vector l: f (S, I) = F (p, I) · l = dTp F (p)l (8) p∈S
p∈S
Obviously, the calculation of the EdgeField feature is rotation invariant. To compute the feature value of an oriented EdgeField feature, we only need to rotate the template. The EdgeField can be precomputed from the image by convolution. An example of EdgeField and its contour line are shown in Fig. 4 left and middle. Fig. 4 right shows an example of EdgeField feature.
dpi pi f l
f(S,I)
S I
Fig. 4. Left: EdgeField; Middle: Contour line of EdgeField; Right: EdgeField feature
Based on the EdgeField feature, we can learn a boosted classifier for each body part. We use the improved Adaboost algorithm [14] that gives real-valued confidence to its predictions which can be used as the image likelihood of each part. The weak classifier based on feature S is defined as follow: h(Ii ) = lutj , vj < f (S, Ii ) ≤ vj+1
(9)
Where the range of feature value is divided into several sub ranges{(vj , vj+1 )}, and the weak classifier output a certain constant confidence luti in each sub range. A crucial problem in learning boosted classifier with EdgeField feature is that the number of different EdgeField features is enormous and exhaustive search in conventional boosting learning becomes impractical. So we design a heuristic searching procedure as follow:
180
Y. Su et al.
Algorithm 1. Weak classifier selection for the ith part Input: samples {Iij }, labels of the sample{yij } and the corresponding weight{Dij } 1. Initialize Initialize the open feature list: OL = {S1 , S2 , . . . , Sn } and the close feature list: CL = φ 2. Grow and search: for i = 1 to T lost from OL: Select the feature S ∗ with the lowest classification S ∗ = argmins∈OL (Z(S)) where Z(S) = i Di exp(−yi h(Ii )) OL = OL − {S ∗ }, CL = CL ∪ {S ∗ } Generate new features based on S∗ under the constraint in Fig. 5, then learn parameters and calculate the classification lost and put them in OL. end for 3. Output: Select the feature with the lowest classification loss for OL and CL
This algorithm initializes with EdgeField features with only one template pixel, and generates new features based on current best feature under the growing constraint which keep the template segment of the feature simple and smooth.
Fig. 5. Growing constraint of features: Left: the second pixel (gray square) can only be placed at 5 certain directions from the first pixel (black square); Middle: The following pixels can only be placed in the proceeding direction of current segment and in the adjacent directions. Right: the segment cannot cross the horizontal line crossing the first pixel.
When the position of each pixel p in the feature is given, we need to learn the parameters of the feature: the projection vector l and the gradient of each pixel dp. The objective is to best separate the positive and negative samples. Similar with common object detection task, the feature values of positive samples are surrounded by those of negative samples, so we take the Maximal-RejectionClassifier criterion [18] which minimizes the positive scatter while maximizing the negative scatter: ({dpi , l}) = argminRpos /Rneg
(10)
Where Rpos and Rneg are the covariance of the feature values of positive sample and negative samples respectively.
Human Pose Estimation Using Exemplars and Part Based Refinement
181
Since it is hard to learning l and dp simultaneously, we first discretize l into 12 directions, and learn {dp } for each l. Given l, the feature value can be computed as follow:
dTpi F (pi )l = dTp1 , dTp2 , . . . , dTpn (F (p1 )l, F (p2 )l, . . . , F (pn )l) (11) f (S, I) = pi ∈S
Then the learning of {dp } becomes a typical MRC problem where dTp1 , . . . , dTpn is taken as the projection vector. Following [16], we adopt weighted MRC to take the sample weights into account. Given the boosted classifier of each part, the image likelihood can be calculated as follow: hk , i(Ii )))−1 (12) ψi (li ) = (1 + exp(−Hi (Ii )))−1 = (1 + exp(− k
4.3
Iterative Pose Inference
Belief propagation (BP)[17] is a popular and effective inference algorithm of tree structured model inference. BP calculates the desired marginal distributions p(li—Y) by local message passing between connected nodes. At iteration n, each node t ∈ V calculates a message mnts (lt ) that will be passed to a node s ∈ Γ (t): n ψst (ls , lt )ψt (lt ) × mn−1 (13) mts (ls ) = α ut (ls )dlt lt
u∈Γ (s)
Where α = 1/ mnts (ls )is the normalization coefficient. The approximation of the marginal distribution p(ls |Y ) can be calculated in each iteration: mnut (ls ) (14) pˆ(ls |Y ) = αψt (lt ) u∈Γ (s)
In tree structured models, pˆ(ls |Y ) will converge to p(ls |Y ) when the messages are passed to every nodes in the graph. In our case the image likelihood is discontinuous so that the integral in equation (13) becomes product over all feasible states lt. Although the present of the BP inference reduces the complexity for pose inference, the feasible state space of each part remains unconquerable by exhaustive search. Inspired by the work of Active Shape Model (ASM), we designed an iterative inference procedure which limits each part in the neighborhood of several initial states and inference locally in each iteration. The whole iterative inference algorithm is shown in algorithm 2. In each iteration, the searching seeds are the initial guesses of the states of each part, and we sample uniformly from the neighborhood of each searching seeds to generate searching candidates. And then BP inference is applied over the searching candidates to calculate the marginal distribution over the candidates. And the searching seeds for next iteration are selected from the candidates using MAP according to the marginal distribution.
182
Y. Su et al.
Unlike ASM which uses the reconstructed shape as the initial shape for the local search stage, we keep the first n best candidates as the searching seeds for the next iteration and searching candidates are generated based on the searching seeds. Preserving multiple searching seeds reduce the risk of improper initial pose.
Algorithm 2. Iterative pose inference Input: searching seeds for each part Si = {li1 , li2 , . . . , lin } Generating candidates: Ci = {li |li ∈ N eighborhood(lki ), lki ∈ Si } For each part i: Calculate the image likelihood ψi (li )using boosted classifiers BP inference: Loop until convergence: Calculate messages: For each part s and the messages
t, t ∈ Γ (s) caculate n mts = α lt ∈Ct ψst (ls , lt )ψt (lt ) × u∈Γ (s) mn−1 ut (ls) n Calculate belief: pˆ(ls |Y ) = αψt (lt ) u∈Γ (s) mut (ls ) Find searching seeds for each part i: Si =candidates with the first n largest belief values Go to generating candidates if Si do not converge The estimate pose L = {Si }
5 5.1
Experiments Experiment Data
Our experiments are taken on following dataset: 1. Synthesized dataset: A dataset of 1000 images synthesized using Poser 8. Each image contains a human and the pose is labeled manually. And the backgrounds of these images are the images selected randomly selected from other datasets. 2. The Buffy dataset from [11]: This dataset contains 748 video frames over 5 episodes of the fifth season of Buffy: the vampire slayer. The upper body pose (including head, torso, upper arms and lower arms) are labeled manually in each image. This is a challenging dataset due to the changing of poses, different clothes and strongly varying illumination. As in [11, 12], we use 276 images in 3 episodes as the testing set, and the rest 472 as training set. 3. The playground dataset: This dataset contains 350 images of people play football and basketball in the playground. The full body pose is labeled manually. 150 images from this data set are used as the training set, and the rest 200 is the testing set. 4. The iterative image parsing dataset from[11]: this dataset contains 305 images of people engaged in various activities such as standing, performing exercises, dancing, etc. the full body pose in each image is annotated manually. This dataset is more challenging because there is no constraint on human poses, clothes, illumination and backgrounds. 100 images in this dataset are used as the training set and the rest 205 is the testing set.
Human Pose Estimation Using Exemplars and Part Based Refinement
0.6 0.55
Our descriptor HOG descriptor
0.85
Mean Error
Mean Error
0.7 0.65
1
0.9
Our descriptor HOG descriptor
Mean Error
0.8 0.75
0.8 0.75 0.7
0
5 10 15 number of selected examplars
20
0.65
183
0
5 10 15 number of selected examplars
20
Our descriptor HOG descriptor
0.9 0.8 0.7
0
5 10 15 number of selected examplars
20
Fig. 6. Mean minimum errors of selected examplars in dataset 2,3,4 (from left to right)
5.2
Experiments of Examplar Based Initialization
To learn the examplars for full body pose estimation, we used the whole dataset 1 to learn examplars. The coordinates of 14 joint point of each image are connected as a pose vector; GPLVM is used to learn a low dimensional manifold of the pose space. The active set of GPLVM is taken as the examplars and the corresponding descriptors are stored. The upper body examplars are learnt similarly by using the bounding box of upper body. Given an image and a bounding box of human, we generate 100 candidate bounding box by adding random disturbance to the bounding box. And for each examplar, we find the best candidate bounding box with the shortest distance of the descriptor, and the first 20 examplars with the shortest distance are used as the initial pose in the next stage. We measures the accuracy of this stage on the datasets in terms of the mean of minimum distance of corresponding joint points divided by the part length between ground truth and the top n selected examplars, as shown in Fig. 6. 5.3
Experiments of Pose Estimation
For the part classifier, in the training sets, we cut out image patches according to the bounding box of each part and normalize them into the same orientation and scale and use them as positive samples. Negative samples are obtained by adding large random disturbance to the ground truth states of each part. In the pose estimation procedure, the top 20 matched examplars are used as the searching seeds. 100 searching candidates of each part are generated based on each searching seeds. During each iteration, the candidates with top 20 posterior marginal possibilities are preserved as the searching seeds for the next iteration. To compare with previous methods, we used the code of [12]. The images are cut out based on the bounding box (with some loose), and then are processed by the code. The correctness of each part in the testing sets of dataset 2,3,4 are listed in table 1. A part is correct if the distance between estimated joint points and ground truth is below 50% of the part length. From the result we can see that our method can achieve better correctness with much less time cost. Fig. 7 give some examples in dataset 2,3,4.
184
Y. Su et al. Table 1. Correctness of parts in dataset 2,3,4
part
dataset 2 Method Our in [12] method
dataset 3 Method Our in [12] method
Torso 91% 93% 81% Upper Arm 82% 83% 83% 85% 45% 51% Lower Arm 57% 59% 61% 64% 32% 31% Upper Leg 54% 61% Lower Leg 43% 45% Head 96% 95% 81% Total 78% 80% 52% Time cost 65.1s 5.3s 81.2s
84% 57% 59% 41% 37% 67% 63% 51% 50% 89% 59% 9.1s
dataset 3 Method Our in [12] method 84% 56% 59% 35% 38% 68% 59% 66% 51% 78% 59% 79.1s
87% 64% 62% 41% 43% 67% 68% 69% 60% 80% 64% 8.9s
Fig. 7. Examples in dataset 2,3,4
6
Conclusion
In this paper, we propose a fast and robust human estimation framework that combines the top-down example based method and bottom-up part based method. Example based method is first used to get initial guesses of the body pose and then a novel iterative BP inference schema is utilized to refined the estimated pose using a tree-structured kinematic model and discriminative part classifiers based on a rotation invariant EdgeField feature. The combination for top-down and bottom-up methods and the iterative inference schema enable us to achieve accurate pose estimation in acceptable time cost. Acknowledgement. This work is supported by National Science Foundation of China under grant No.61075026, and it is also supported by a grant from Omron Corporation.
Human Pose Estimation Using Exemplars and Part Based Refinement
185
References 1. Agarwal, A., Triggs, B.: Recovering 3d human pose from monocular images. PAMI 28(1), 44–58 (2006) 2. Bissacco, A., Yang, M., Soatto, S.: Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In: CVPR, vol. 13, pp. 234– 778 (2007) 3. Poppe, R.: Evaluating example-based pose estimation: Experiments on the humaneva sets. In: CVPR 2nd Workshop on EHuM2, vol. 14, pp. 234–778 (2007) 4. Poppe, R., Poel, M.: Comparison of silhouette shape descriptors for example-based human pose recovery. AFG (2006) 5. Mori, G., Malik, J.: Recovering 3d human body configurations using shape contexts. IEEE PAMI, 1052–1062 (2006) 6. Rogez, G., Rihan, J., Ramalingam, S., Orrite, C., Torr, P.H.S.: Randomized Trees for Human Pose Detection. In: CVPR (2008) 7. Shakhnarovich, G., Viola, P., Darrell, R.: Fast pose estimation with parametersensitive hashing. In: ICCV (2003) 8. Jiang, H., Martin, D.R.: Global pose estimation using non-tree models. In: CVPR (2008) 9. Sigal, L., Black, M.J.: Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In: CVPR (2006) 10. Sigal, L., Bhatia, S., Roth, S., et al.: Tracking Loose-limbed People. In: CVPR (2004) 11. Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS (2006) 12. Andriluka, M., Roth, S., Schiele, B.: Pictorial Structures Revisited: People Detection and Articulated Pose Estimation. In: CVPR (2009) 13. Hill, A., Cootes, T.F., Taylor, C.J.: Active shape models and the shape approximation problem. In: 6th British Machine Vison Conference, pp. 157–166 (1995) 14. Lawrence, N.D.: Gaussian process latent variable models for visualization of high dimensional data. In: NIPS, vol. 16, pp. 329–336 (2004) 15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 16. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999) 17. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding Belief Propagation and its Generalizations. IJCAI (2001) 18. Xu, X., Huang, T.S.: Face Recognition with MRC-Boosting. In: ICCV (2005)
Full-Resolution Depth Map Estimation from an Aliased Plenoptic Light Field Tom E. Bishop and Paolo Favaro Department of Engineering and Physical Sciences Heriot-Watt University, Edinburgh, UK {t.e.bishop,p.favaro}@hw.ac.uk
Abstract. In this paper we show how to obtain full-resolution depth maps from a single image obtained from a plenoptic camera. Previous work showed that the estimation of a low-resolution depth map with a plenoptic camera differs substantially from that of a camera array and, in particular, requires appropriate depth-varying antialiasing filtering. In this paper we show a quite striking result: One can instead recover a depth map at the same full-resolution of the input data. We propose a novel algorithm which exploits a photoconsistency constraint specific to light fields captured with plenoptic cameras. Key to our approach is handling missing data in the photoconsistency constraint and the introduction of novel boundary conditions that impose texture consistency in the reconstructed full-resolution images. These ideas are combined with an efficient regularization scheme to give depth maps at a higher resolution than in any previous method. We provide results on both synthetic and real data.
1
Introduction
Recent work [1, 2] demonstrates that the depth of field of Plenoptic cameras can be extended beyond that of conventional cameras. One of the fundamental steps in achieving such result is to recover an accurate depth map of the scene. [3] introduces a method to estimate a low-resolution depth map by using a sophisticated antialiasing filtering scheme in the multiview stereo framework. In this paper, we show that the depth reconstruction can actually be obtained at the full-resolution of the sensor, i.e., we show that one can obtain a depth value at each pixel in the captured light field image. We will show that the fullresolution depth map contains details not achievable with simple interpolation of a low-resolution depth map. A fundamental ingredient in our method is the formulation of a novel photoconsistency constraint, specifically designed for a Plenoptic camera (we will also use the term light field (LF) camera as in [4]) imaging a Lambertian scene (see Fig. 1). We show that a point on a Lambertian object is typically imaged under several microlenses on a regular lattice of correspondences. The geometry of such lattice can be easily obtained by using a Gaussian optics model of the Plenoptic camera. We will see that this leads to the reconstruction of full-resolution views R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 186–200, 2011. c Springer-Verlag Berlin Heidelberg 2011
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
187
consisting of mosaiced tiles from the sampled LF. Notice that in the data that we have used in the experiments, the aperture of the main lens is a disc. This causes the square images under each microlens to be split into a disc containing valid measurements and 4 corners with missing data (see one microlens image in the right image in Fig. 1). We show that images with missing data can also be reconstructed quite well via interpolation and inpainting (see e.g. [5]). Such interpolation however is not necessary if one uses a square main lens aperture and sets the F-numbers of the microlenses and the main lens appropriately. More importantly, we will show that the reconstructed samples form tiled mosaics, and one can enforce consistency at tile edges by introducing additional boundary conditions constraints. Such constraints are fundamental in obtaining the full-resolution depth estimates. Surprisingly, no similar gain in depth resolution is possible with a conventional camera array. In a camera array the views are not aliased (as the CCD sensors have contiguous pixels), and applying any existing method on such data (including the one proposed here) only results in upsampling of the low-resolution depth map (which is just the resolution of one view, and not the total number of captured pixels). This highlights another novel advantage of LF cameras. 1.1
Prior Work and Contributions
The space of light rays in a scene that intersect a plane from different directions may be interpreted as a 4D light field (LF) [6], a concept that has been useful in refocusing, novel view synthesis, dealing with occlusions and non-Lambertian objects among other applications. Various novel camera designs that sample the LF have been proposed, including multiplexed coded aperture [7], Heterodyne mask-based cameras [8], and external arrangements of lenses and prisms [9]. However these systems have disadvantages including the inability to capture dynamic scenes [7], loss of light [8], or poor optical performance [9]. Using an array of cameras [10] overcomes these problems, at the expense of a bulky setup. Light Field, or Plenoptic, cameras [4, 11, 12], provide similar advantages, in more compact portable form, and without problems of synchronisation and calibration. LF cameras have been used for digital refocusing [13] capturing extended depth-of-field images, although only at a (low) resolution equal to the number of microlenses in the device. This has been addressed by super-resolution methods [1, 2, 14], making use of priors that exploit redundancies in the LF, along with models of the sampling process to formulate high resolution refocused images. The same sampling properties that enable substantial image resolution improvements also lead to problems in depth estimation from LF cameras. In [3], a method for estimating a depth map from a LF camera was presented. Differences in sampling patterns of light fields captured by camera arrays and LF cameras were analysed, revealing how spatial aliasing of the samples in the latter case cause problems with the use of traditional multi-view depth estimation methods. Simply put, the LF views (taking one sample per microlens) are highly undersampled for many depths, and thus in areas of high frequency texture, an error term based on matching views will fail completely. The solution proposed in [3]
188
T.E. Bishop and P. Favaro
Fig. 1. Example of LF photoconsistency. Left: A LF image (courtesy of [15]) with an enlarged region (marked in red) shown to the right. Right: The letter E on the book cover is imaged (rotated by 180◦ ) under nine microlenses. Due to Lambertianity each repetition has the same intensity. Notice that as depth changes, the photoconsistency constraint varies by increasing or decreasing the number of repetitions.
was to filter out the aliased signal component, and perform matching using the remaining valid low-pass part. The optimal filter is actually depth dependent, so an iterative scheme was used to update the filtering and the estimated depth map. Whilst this scheme yields an improvement in areas where strong aliasing corrupts the matching term, it actually throws away useful detail that could be used to improve matches if interpreted correctly. Concurrent with this work, [16] proposed estimation of depth maps from LF data using cross-correlation, however this also only gives low resolution results. In this paper we propose to perform matching directly on the sampled lightfield data rather than restricting the interpretation to matching views in the traditional multi-view stereo sense. In this way, the concept of view aliasing is not directly relevant and we bypass the need to antialias the data. A key benefit of the new formulation is that we compute depth maps at a higher resolution than the sampled views. In fact the amount of improvement attainable depends on the depth in the scene, similar to superresolution restoration of the image texture as in [2]. Others have considered super-resolution of surface texture along with 3D reconstruction [17] and depth map super-resolution [18], however in the context of multi-view stereo or image sequences, where the sampling issues of a LF camera do not apply. Specifically, our contributions are: 1. A novel method to estimate a depth value at each pixel of a single (LF) image; 2. Analysis of the subimage correspondences, and reformulation in terms of virtual full-resolution views, i.e., mosaicing, leading to a novel photoconsistency term for LF cameras; 3. A novel penalty term that enforces gradient consistency across tile boundaries of mosaiced full-resolution views.
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
2
189
Light Field Representation and Methodology Overview
We consider a plenoptic camera, obtained by adding a microlens array in front of a conventional camera’s sensor [1, 2, 4]. The geometry of this camera is almost identical to that of a camera array where each camera’s aperture corresponds to a microlens. However, the LF camera has an additional main lens in front of the array, and this dramatically affects the sampling of the LF. We will see in sec. 3 that the precise study of how the spatial and angular coordinates sample the light field is critical to the formulation of matching terms for depth estimation. Let us consider a version of the imaging system where the microlens apertures are small and behave as pinholes.1 In the LF camera, each microlens forms its own subimage Sc (θ) on the sensor, where c ∈ R2 identifies the center of a microlens and θ ∈ R2 identifies the local (angular) coordinates under such a microlens. In a real system the microlens centers are located at discrete positions ck , but to begin the analysis we generalize the definition to virtual microlenses that are free to be centered in the continuous coordinates c. The main lens maps an object in space to a conjugate object inside the camera. Then, pixels in each subimage see light from the conjugate object from different vantage points, and have a one-to-one map with points on the main lens (see Fig. 2). The collection of pixels, one per microlens, that map to the same point on the main lens (i.e., with the same local angle θ) form an image we call a view, denoted by Vθ (c) ≡ Sc (θ). The views and the subimages are two equivalent representations of the same quantity, the sampled light field. We assume the scene consists of Lambertian objects, so that we can represent the continuous light field with a function r(u) that is independent of viewing angle. In our model, r(u) is called the radiance image or scene texture, and relates to the LF via a reprojection of the coordinates u ∈ R2 at the microlens plane, through the main lens center into space. That is, r(u) is the full-resolution all-focused image that would be captured by a standard pinhole camera with the sensor placed at the microlens plane. This provides a common reference frame, independent of depth. We also define the depth map z(u) on this plane. 2.1
Virtual View Reconstructions
Due to undersampling, the plenoptic views are not suitable for interpolation. Instead we may interpolate within the angular coordinates of the subimages. In our proposed depth estimation method, we make use of the fact that in the Lambertian light field there are samples, indexed by n, in different subimages that can be matched to the same point in space. In fact, we will explain how several estimates rˆn (u) of the radiance can be obtained, one per set of samples with common n. We term these full resolution virtual views, because they make use of more samples than the regular views Vθ (c), and are formed from interpolation 1
Depending on the LF camera settings, this holds true for a large range of depth values. Moreover, we argue in sec. 4 that corresponding defocused points are still matched correctly and so the analysis holds in general for larger apertures.
190
T.E. Bishop and P. Favaro
r’
O
microlenses
r
z’
c1
θ
- v’θq v
c2
u
θq μ
d
θq θ1 θ2 θq
v’
cK
Main Lens
v
conjugate image v’ v
θ1
v’
- v’ vθ
λ(u)(c-u) u
z’ (u-c) u’ v’ O
Main Lens
z‘
θ0
c v’-z‘
θ0
v
θ Sensor
Fig. 2. Coordinates and sampling in a plenoptic camera. Left: In our model we begin with an image r at v , and scale it down to the conjugate image r at z , where O is the main lens optical center. Center: This image at r is then imaged behind each microlens; for instance, the blue rays show the samples obtained in the subimage under the central microlens. The central view is formed from the sensor pixels hit by the solid bold rays. Aliasing of the views is present: the spatial frequency in r is higher than the microlens pitch d, and clearly the adjacent view contains samples unrelated to the central view via interpolation. Right: Relation between points in r and corresponding points under two microlenses. A point u in r has a conjugate point (the triangle) at u , which is imaged by a microlens centered at an arbitrary center c, giving a projected point at angle θ. If we then consider a second microlens positioned at u, the image of the same point is located at the central view, θ = 0, on the line through O.
in the non-aliased angular coordinates. Similarly to [1], these virtual views may be reconstructed in a process equivalent to building a mosaic from parts of the subimages (see Fig. 3). Portions of each subimage are scaled by a magnification factor λ that is a function of the depth map, and these tiles are placed on a new grid to give a mosaiced virtual view for each n. As λ changes, the size of the portions used and the corresponding number of valid n are also affected. We will describe in detail how these quantities are related and how sampling and aliasing occurs in the LF camera in sec. 3, but firstly we describe the overall form of our depth estimation method, and how it uses the virtual views. 2.2
Depth Estimation via Energy Minimization
We use an energy minimization formulation to estimate the optimal depth map z = {z(u)} in the full-resolution coordinates u. The optimal z is found as the minimum of the energy E defined as: E1 (u, z(u)) + γ2 E2 (u, z(u)) + γ3 E3 (z(u)) (1) E(z) = u
where γ2 , γ3 > 0. This energy consists of three terms. Firstly, E1 is a data term, penalizing depth hypotheses where the corresponding points in the light field do not have matching intensities. For the plenoptic camera, the number of virtual views rˆn (u) depends on the depth hypothesis. We describe this term in sec. 4.
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
191
Fig. 3. Mosaiced full-resolution views generated from subimages. Each microlens subimage in a portion of the light field image shown on the left is partitioned into a grid of tiles. Rearranging the tiles located at the same offset from the microlens center into a single image produces mosaiced full-resolution views as shown in the middle. As we used a Plenoptic camera with a circular main lens aperture, some pixels are not usable. In the top view, the tiles are taken from the edge of each microlens and so there is missing data which is inpainted and shown on the image to the right. We use these full-resolution views to perform accurate data matching for depth reconstruction.
E2 is a new matching term we introduce in sec. 4.1, enforcing the restored views to have proper boundary conditions. It effectively penalizes mismatches at the mosaiced tile boundaries, by imposing smoothness of the reconstructed texture (that also depends on the depth map z). The third term, E3 , is a regularization term that enforces piecewise smoothness of z. Because the terms E1 and E2 only depend on z in a pointwise manner, they may be precomputed for a set of depth hypotheses. This enables an efficient numerical algorithm to minimize eq. (1), which we describe in sec. 4.3. Before detailing the implementation of these energy terms in sec. 4, we show in sec. 3 how such views are obtained from the sampled light field.
3
Obtaining Correspondences in Subimages and Views
We summarise here the image formation process in the plenoptic camera, and the relations between the radiance, subimages and views. We consider first a continuous model and then discretise, explaining why aliasing occurs. In our model, we work entirely inside the camera, beginning with the full-resolution radiance image r(u), and describe the mapping to the views Vθ (c) or microlens subimages Sc (θ). This mapping is done in two steps: from r(u) to the conjugate object (or image) at z , which we denote r , and then from r to the sensor. This process is shown in Fig. 2. There is a set of coordinates {θ, c} that corresponds to the same u, and we will now examine this relation. We see in Fig. 2 that the projection of r(u) to r (u ) through the main lens u , where v is the main lens to microlens array center O results in u = z v(u) distance. This is then imaged onto the sensor at θ(u, c) through the microlens
192
T.E. Bishop and P. Favaro
at c with a scaling of − v −zv (u) . Notice that θ(u, c) is defined as the projection of u through c onto the sensor relative to the projection of O through c onto the sensor; hence, we have z (u) v (c − u) v − z (u) v = λ(u)(c − u),
θ(u, c) =
(2)
where we defined the magnification factor λ(u) = v −zv (u) z v(u) , representing the signed scaling between the regular camera image r(u) that would form at the microlens array, and the actual subimage that forms under a microlens. We can then use (2) to obtain the following relations: r(u) = Sc (θ(u, c)) = Sc λ(u)(c − u) , v D (3) ∀c s.t. |θ(u, c)| < v 2 = Vλ(u)(c−u) (c), θ v D r(u) = Vθ (c(θ, u)) = Vθ u + (4) ∀ |θ| < . λ(u) v 2
The constraints on c or θ mean only microlenses whose projection − vv θ is within the main lens aperture, of radius D 2 , will actually image the point u. Equations (3) and (4) show respectively how to find values corresponding to r(u) in the light field for a particular choice of either c or θ, by selecting the other variable appropriately. This does not pose a problem if both c and θ are continuous, however in the LF camera this is not the case and we must consider the implications of interpolating sampled subimages at θ = λ(u)(c − u) or θ . sampled views at c = u + λ(u) 3.1
Brief Discussion on Sampling and Aliasing in Plenoptic Cameras
In a LF camera with microlens spacing d, only a discrete set of samples in each . view is available, corresponding to the microlens centers at positions c = ck = dk, where k indexes a microlens. Also, the pixels (of spacing μ) in each subimage . sample the possible views at angles θ = θq = μq, where q is the pixel index; these coordinates are local to each microlens, with θ0 the projection of the main lens center through each microlens onto the sensor. Therefore, we define the discrete . observed view Vˆq (k) = Vθq (ck ) at angle θq as the image given by the samples . for each k. We can also denote the sampled subimages as Sˆk (q) = Vθq (ck ). For a camera array, the spatial samples are not aliased, and a virtual view Vθ (c) may be reconstructed accurately, which allows for the sub-pixel matching used in regular multi-view stereo. However, in the LF camera, there is clearly a large jump between neighboring samples from the same view in the conjugate image (see Fig. 2). Even when the microlenses have apertures sized equal to their spacing, as described in [3], the resulting microlens blur size changes with
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
193
depth. Equivalently there is a depth range where the integration region size in the conjugate image is less than the sample spacing. This means that the sampled views Vˆq (k) may be severely aliased, depending on depth. As a result, simple interpolation of Vˆq (k) in the spatial coordinates k is not enough to reconstruct an arbitrary virtual view Vθ (c) at c = ck . 3.2
Matching Sampled Views and Mosaicing
In an LF camera we can reconstruct Vθ (c) and avoid aliasing in the sampled views by interpolating instead along the angular coordinates of the subimages, and using the spatial samples only at known k. Using eq. (3), we can find the estimates rˆ(u) of the texture by interpolating as follows: r(u) = Vλ(u)(ck −u) (ck ) ⇒ rˆ(u) = Vˆ λ(u) (k), μ
(ck −u)
∀k s.t. |λ(u)(ck − u)| <
v D v 2
(5)
. Let us now expand ck as u + Δu + nd, where Δu = ck0 − u is the vector from . u to the closest microlens center ck0 = u − mod (u, d). We can now enumerate the different possible reconstructed images rˆn (u) for all valid choices of n ∈ Z2 that lead to the condition on ck being satisfied. Then, we find rˆn (u) = Vˆ λ(u) (Δu+nd) k0 + n , μ
∀n s.t. |λ(u)(Δu + nd)| <
v D v 2
(6)
The reconstructed virtual views rˆn (u) for different n should match at position u, if we have made the right hypothesis for the depth z(u) and hence λ(u) (notice that given λ(u) one can immediately obtain z(u) and vice versa). This interpretation is similar to that of the method used in [1], except that rather than averaging together the full-resolution mosaics rˆn (u), we keep them separated.
4
Energy Terms for Full-Resolution Depth Maps
We now describe the energy terms in sec. 2.2. Firstly, we define the matching term E1 that penalizes differences between the different virtual views2 rˆn (u) as E1 u, z(u) =
1 W (u)
rˆn (u) − rˆn (u).
∀n=n n,n ∈Z2 n s.t. |λ(u)(Δu+nd)|< vv
(7)
D 2 D 2
n s.t. |λ(u)(Δu+n d)|< vv
W (u) is a normalization term, counting the number of valid pairs in the sum. The number of possible n, and hence number of virtual views, changes based on the magnification factor λ, according to the depth hypothesis. Therefore 2
Interestingly, unlike matching low resolution views in multi-view stereo, the virtual views are already aligned to each other in regions with the correct depth assumption.
194
T.E. Bishop and P. Favaro x border gradients:
pixels used for x-gradients:
xx
xy 2 high-res views (wrong depth assumption)
partial borders interpolated
borders propagated via linear interpolation
cell boundary
Fig. 4. Left: The energy term E2 penalizes unnatural discontinuities in the reconstructed full-resolution views, at the mosaic cell boundaries. Interpolation is used to complete partial views, and interior pixels from the borders. Right: The x-border gradients at the red pixels use the supports shown, using the difference between the pairs of gradients given between the yellow pairs.
W (u) varies with both position and depth hypothesis. Also, when using a Plenoptic camera with a circular main lens aperture, we need to deal with missing data. In this case, parts of some of the views will be missing where tiles were taken from the borders of a microlens (see Fig. 3). For this reason, we consider using inpainting (we use a PDE implementation in MATLAB3 ) to slightly reduce the amount of missing data, i.e., to extend the field of view of each microlens based on extrapolating the data contained in neighboring lenses. Furthermore, rather than comparing all views with each other as in eq. (7), we have found that using a penalty in E1 that compares each mosaic rˆn (u) (at valid locations) with the median obtained by inpainting each rˆn (u) can be more robust, and also considerably faster for large W . We have made the assumption that microlens blur is insignificant at working depths. If blur is present, it will be the same amount for each point being matched, and will not cause any difference for reasonably planar surfaces. In the case of a depth transition, highly textured surfaces will appear slightly larger than they really are, due to matching occurring at all points within the true point blur disc. Ideally, we could also deconvolve each mosaic independently before matching, however, this is beyond the scope of this paper. 4.1
Mosaiced Boundary Condition Consistency
Given the full-resolution mosaiced views, rˆn (u), we may also establish a selfconsistency criterion in E2 . This term says that each reconstructed rˆn (u) should have consistent boundary conditions across blocks within each view. That is, the mosaicing process should not introduce unnatural jumps if the depth is estimated 3
http://www.mathworks.com/matlabcentral/fileexchange/4551
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
195
correctly. Therefore we expect the x- and y-derivatives to be mostly smooth across boundaries (assuming the probability of an image edge coinciding with a mosaic tile along all its edges to be low), and set E2 z(u) = |∇xx rˆn (u)| + |∇xy rˆn (u)| + |∇yx rˆn (u)| + |∇yy rˆn (u)|, n∈Z2
n∈Z2
|θck (n) (u)|< vv D2
|θck (n) (u)|< vv D2
u∈Rboundary-x
u∈Rboundary-y
(8) where ∇ab denotes the 2nd order derivative with respect to the coordinates a and b, Rboundary-x includes the pixels on the left and right tile boundaries, and Rboundary-y the top and bottom boundaries (only non-inpainted regions). The restriction of this constraint to the boundary of the tiles is due to eq. (6). We observe that for a fixed n, eq. (6) tells us that, as we vary u, several consecutive samples are taken from the same microlens subimage at k0 , but there is a discontinuity present in the reconstruction as u moves past a microlens boundary and k0 then jumps to a different microlens k. Changing n will choose a different offset portion of each subimage to reconstruct (effectively a full-resolution virtual view, centered at an angle qn = dλ(u) μ n ). Within each tile the data is continuous and just comes from a scaled version of the same subimage and so we should not expect to penalize differently for a different scaling/depth. / Rboundary via bilinear interpolation We can instead define E2 (z(u)) for u ∈ of the values at the boundary. While this term does not help super-resolving the depth map, it helps enforcing its overall consistency. 4.2
Regularization of the Depth Map
Due to the ill-posedness of the depth estimation problem, we introduce regularization. Our focus here is not on the choice of prior, so for simplicity we use a standard total variation term (∇ denotes the 2D gradient with respect to u): E3 (z(u)) = ∇z(u)1 . 4.3
(9)
Numerical Implementation
At this point we have all the necessary ingredients to work on the energy introduced in eq. (1). In the following we use the notation introduced earlier on where the depth map z is discretized as a vector z = {z(u)}. Then, we pose the depth estimation problem as the following minimization zˆ = arg min E(z) = arg min Edata (z) + γ3 E3 (z) z
z
(10)
where we combine E1 and E2 in the data term Edata = E1 + γ2 E2 ; in our experiments the tradeoff between E3 and Edata is fixed by γ2 = 1 and γ3 = 10−3 . We then minimize this energy by using an iterative solution. We first notice that the term Edata can be written as a sum of terms that depend on a single entry of z at a time. Hence, we find a first approximate solution by
196
T.E. Bishop and P. Favaro
ignoring the regularization term and performing a fast brute force search for each entry of z independently. This procedure yields the initial solution z0 . Then, we approximate Edata via a Taylor expansion up to the second order, i.e., 1 Edata (zt+1 ) Edata (zt )+∇Edata (zt )(zt+1 −zt )+ (zt+1 −zt )T HEdata (zt )(zt+1 −zt ) 2 (11)
where ∇Edata is the gradient of Edata , HEdata is the Hessian of Edata , and zt and zt+1 are the solutions at iteration t and t+1 respectively. To ensure that our local approximation is convex we let HEdata be diagonal and each element be positive by taking its absolute value (component-wise) |HEdata (zt )|. In the case of the regularization energy we use a Taylor expansion of its gradient up to the linear term. When we compute the Euler-Lagrange equations of the approximate energy E with respect to zt+1 this linearization results in ∇Edata (zt ) + |HEdata (zt )|(zt+1 − zt ) − γ3 ∇ ·
∇(zt+1 − zt ) =0 |∇zt |
(12)
which is a linear system in the unknown zt+1 , and can be solved efficiently via the Conjugate Gradient method.
5
Experiments
We begin by testing the performance of the method on artificially generated light fields, before considering data from a real LF camera. Synthetic Data. We simulated data from a LF camera, with main lens of focal length F = 80mm focused at 0.635m, with parameters d = 0.135mm, μ = 0.09mm, v = 0.5mm, Q = 15 and microlens focal length f = 0.5mm. To compare with the antialiasing method of [3], we estimate the view disparity . μ , in pixels per view, for a set of textured planes at known depths (the s(u) = dλ(u) disparity between the views is zero for the main lens plane in focus, and increases with depth). We used the Bark texture from the Brodatz texture dataset, and in each case just used the energy without regularization. We plot the true value of s for each plane against the mean value from the depth maps found by both the method of [3] and the proposed method, along with error bars in Fig. 5. Clearly in the case of no noise the proposed algorithm performs a lot more reliably with a smaller variance of the error, although with increasing noise the benefit over the method of [3] reduces. Both methods perform poorly around the main lens plane in focus: in the case of the method of [3], the disparity becomes smaller than the interpolation accuracy, whereas the proposed method begins to fail as the magnification factor tends towards 1, and there are not enough virtual views to form the median. This is not surprising as it relates to the undersampling of the space by the LF camera around these depths. Real Data. We obtained datasets from a portable LF camera with the same system parameters as above. The scene range was 0.7–0.9m, corresponding to disparities of 0.16–0.52 pixels per view. One of the low resolution views is shown,
0.6
0.6
0.5
0.5
0.4
0.4
estimated disparity
estimated disparity
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
0.3 0.2 0.1
0.3 0.2 0.1
Antialiasing method (Bishop et al.) Proposed method true s
0 ï0.1
197
0.1
0.2
0.3
0.4
true disparity, s, in pixels per view
0.5
0.6
Antialiasing method (Bishop et al.) Proposed method true s
0 ï0.1
0.1
0.2
0.3
0.4
0.5
0.6
true disparity, s, in pixels per view
Fig. 5. Results on synthetic data. We plot the estimated disparity from a textured plane versus the true disparity s (which depends on the depth of the plane) for our method and the method from [3]. Error bars shown at 3 standard deviations. Left: data with no noise. Right: Data with noise standard deviation 2.5% of signal range.
Fig. 6. Results on real data. Top row, left to right: One low resolution view from the light field; One full-resolution virtual view, with depth hypothesis in the middle of the scene range (note blocking artefacts); Low resolution depth map using multiview stereo method [3]; Bottom row: Unregularized full-resolution depth map (note that the textureless areas shown in red do not yield a depth estimate); After regularization; High resolution all-in-focus image obtained using the estimated depth map.
198
T.E. Bishop and P. Favaro
Fig. 7. Results on real data. From left to right: One low resolution view from the light field; Regularized depth estimate; High resolution image estimate from the light field, using estimated depth map. Source LF data in bottom row courtesty of [16].
along with a full-resolution virtual view used in the algorithm, and the estimated depthmap before and after regularization in Fig. 6 and Fig. 7. We also applied the method to data from [16] in the bottom row of Fig. 7. In each case, before applying regularization, we automatically select regions where the confidence is too low due to lack of texture (the energy has very small values for all depths at these pixels, but is estimated very reliably elsewhere). For this size image (2100× 1500 pixel virtual views in the case of Fig. 6), our MATLAB implementation takes about half an hour to estimate E1 and E2 , using up to n = 37 virtual views at the maximum depths, although solving eq. (12) only takes a couple of minutes. Notice the main burden is not in the regularisation scheme but simple interpolation and difference operations between a large number of views (e.g. 40*37=1480 views if using n = 37 and 40 depth levels), which could be done at least an order of magnitude faster using, for example, GPU acceleration.
6
Conclusions
We have demonstrated how depth can be estimated at each pixel of a single (LF) image. This is an unprecedented result that highlights a novel quality of Plenoptic cameras. We have analysed the sampling tradeoffs in LF cameras and shown
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
199
how to obtain valid photoconsistency constraints, which are unlike those for a regular multi-view matching. We also presented a novel boundary consistency constraint in reconstructed full-resolution views to penalize incorrect depths, and proposed a novel algorithm to infer full-resolution depth maps using these terms, with efficient regularization. Acknowledgement. This work has been supported in part by EPSRC grant EP/F023073/1(P), grant EP/I501126/1, and SELEX/HWU/2010/SOW3.
References 1. Lumsdaine, A., Georgiev, T.: The focused plenoptic cameras. In: ICCP 2009 (IEEE International Conference on Computational Photography) (2009) 2. Bishop, T.E., Zanetti, S., Favaro, P.: Light field superresolution. In: ICCP 2009 (IEEE Int. Conference on Computational Photography) (2009) 3. Bishop, T.E., Favaro, P.: Plenoptic depth estimation from multiple aliased views. In: ICCV 2009 Workshops: 3DIM (The 2009 IEEE International Workshop on 3-D Digital Imaging and Modeling), Kyoto, Japan (2009) 4. Ng, R., Levoy, M., Br´edif, M., Duval, G., Horowitz, M., Hanrahan, P.: Light field photography with a hand-held plenoptic camera. Technical Report CSTR 2005-02, Stanford University (2005) 5. Weickert, J., Welk, M.: Tensor field interpolation with pdes. In: Weickert, J., Hagen, H. (eds.) Visualization and Processing of Tensor Fields, pp. 315–325. Springer, Berlin (2006) 6. Levoy, M., Hanrahan, P.: Light field rendering. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1996, pp. 31–42. ACM Press, New York (1996) 7. Liang, C.K., Lin, T.H., Wong, B.Y., Liu, C., Chen, H.H.: Programmable aperture photography: Multiplexed light field acquisition. ACM Transactions on Graphics (Proc. SIGGRAPH 2008) 27 (2008) 8. Veeraraghavan, A., Raskar, R., Agrawal, A.K., Mohan, A., Tumblin, J.: Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing. ACM Trans. Graph (Proc. SIGGRAPH 2007) 26, 69 (2007) 9. Georgeiv, T., Zheng, K.C., Curless, B., Salesin, D., Nayar, S., Intwala, C.: Spatioangular resolution tradeoff in integral photography. In: Akenine-Mller, T., Heidrich, W. (eds.) Eurographics Symposium on Rendering (2006), Adobe Systems, University of Washington, Columbia University (2006) 10. Vaish, V., Levoy, M., Szeliski, R., Zitnick, C., Kang, S.B.: Reconstructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures. In: Proc. CVPR 2006, vol. 2, pp. 2331–2338 (2006) 11. Adelson, E.H., Wang, J.Y.: Single lens stereo with a plenoptic cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence 14, 99–106 (1992) 12. Fife, K., El Gamal, A., Wong, H.S.: A 3d multi-aperture image sensor architecture. In: Custom Integrated Circuits Conference, CICC 2006, pp. 281–284. IEEE, Los Alamitos (2006) 13. Ng, R.: Fourier slice photography. ACM Trans. Graph. 24, 735–744 (2005) 14. Levin, A., Freeman, W.T., Durand, F.: Understanding camera trade-offs through a bayesian analysis of light field projections. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 88–101. Springer, Heidelberg (2008)
200
T.E. Bishop and P. Favaro
15. Georgiev, T., Lumsdaine, A.: Depth of field in plenoptic cameras. In: Alliez, P., Magnor, M. (eds.) Eurographics 2009 (2009) 16. Georgiev, T., Lumsdaine, A.: Reducing plenoptic camera artifacts. Computer Graphics Forum 29, 1955–1968 (2010) 17. Goldluecke, B., Cremers, D.: A superresolution framework for high-accuracy multiview reconstruction. In: IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan, pp. 342–351 (2009) 18. Li, F., Yu, J., Chai, J.: A hybrid camera for motion deblurring and depth map super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8 (2008)
Indoor Scene Classification Using Combined 3D and Gist Features Agnes Swadzba and Sven Wachsmuth Applied Informatics, Bielefeld University, Germany {aswadzba,swachsmu}@techfak.uni-bielefeld.de
Abstract. Scene categorization is an important mechanism for providing high-level context which can guide methods for a more detailed analysis of scenes. State-of-the-art techniques like Torralba’s Gist features show a good performance on categorizing outdoor scenes but have problems in categorizing indoor scenes. In contrast to object based approaches, we propose a 3D feature vector capturing general properties of the spatial layout of indoor scenes like shape and size of extracted planar patches and their orientation to each other. This idea is supported by psychological experiments which give evidence for the special role of 3D geometry in categorizing indoor scenes. In order to study the influence of the 3D geometry we introduce in this paper a novel 3D indoor database and a method for defining 3D features on planar surfaces extracted in 3D data. Additionally, we propose a voting technique to fuse 3D features and 2D Gist features and show in our experiments a significant contribution of the 3D features to the indoor scene categorization task.
1
Introduction
An important ability for an agent (both human and robot) is to build an instantaneous concept of the surrounding environment. Such holistic concepts activate top-down knowledge that guides the visual analysis in further scene interpretation tasks. For this purpose, Oliva et al. [1] introduced the so-called Gist features that represent the spatial envelope of a scene given as a 2D image. Based on this general idea a variety of scene recognition approaches have been proposed that work well for outdoor scenes – like “buildings”, “street”, “mountains”, etc. – but frequently break down for indoor scenes, e.g. “kitchen”, “living room”, etc.
Fig. 1. Example room from 3D database, corresponding 3D data for defining 3D features, and computed 2D Gist image R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 201–215, 2011. Springer-Verlag Berlin Heidelberg 2011
202
A. Swadzba and S. Wachsmuth
Recently, there are some approaches, e.g., [2], that try to combine global Gist information with local object information to achieve a better performance on different indoor categories. Although, they achieve significant improvement, the training of such methods relies on a previous hand labeling of relevant regions of interest. Psychological evidence even suggests that objects do not contribute to the low-level visual process responsible for determining the room category. In early experiments, Brewer and Treyen provided hints that with the perception of room schemata objects (e.g., books) were memorized to be in the experimental room (e.g., office) though they were not really present. Henderson et al. [3, 4] even found through their fMRI studies brain areas that are sensitive to 3D geometric structures enabling a distinction of indoor scenes from other complex stimuli but do not show any activation on close-up views of scene-relevant objects, e.g., a kitchen oven. This emphasizes the contribution of the scene geometry to the ability of categorizing indoor scenes. Currently, most approaches to scene recognition are based on color and texture statistics of 2D images neglecting the 3D geometric structure of scenes. Therefore, our goal is to capture the 3D spatial structure of scene categories (see Fig. 1 for impression of used data) in a holistically defined feature vector. More precisely, as man-made environments mainly consist of planar surfaces we define appropriate statistics on these spatial structures capturing their shape and size and their orientations to each other. To analyze their performance on the indoor scene categorization problem a new dataset is introduced consisting of six different room categories and 28 individual rooms that have been recorded with a Time-of-Flight (ToF) sensor in order to provide 3D data. The classification results are contrasted to results achieved with 2D Gist features. Further, possible combination schemes are discussed. In the following, the paper discusses related work, describes the 3D indoor database and the computation of our 3D scene descriptor and combination schemes with 2D features. Finally, evaluation results are discussed.
2
Related Work
As navigation and localization was one of the first tasks exhaustively investigated in robotics research some interesting approaches to solve the problem of recognizing known rooms and categorizing unknown rooms can be found. Known rooms can be recognized by invariant features in 360◦ laser scans [5], by comparing current 2D features to saved views [6], and by recognizing specific objects and their configurations [7]. Burgard’s group has examined a huge bunch of different approaches to the more general place categorization problem ranging from the detection of concepts like “room”, “hall”, and “corridor” by simple geometric features defined on a 2D laser scan [8] to subconcepts like “kitchen” [9] using ontologies encoding the relationship between objects and the subconcepts [10]. Alternatively, occurrence statistics of objects in place categories [11, 12] and interdependencies among objects and locations [13] have been tried out. Parallel to the described approaches, Torralba [14] has developed a lowlevel global image representation by wavelet image decomposition that provides
Indoor Scene Classification Using Combined 3D and Gist Features
203
relevant information for place categorization. These so-called Gist features perform well on a lot of outdoor categories like “building”, “street”, “forest”, etc. but have problems to categorize indoor scenes reliably. Therefore, they have extended their approach by local discriminative information [2]. They learn prototypes from hand-labeled images, which define the mapping between images and scene labels. Similar to those prototypes, Li and Perona represent an image by a collection of local regions denoted as codewords which they obtain by unsupervised learning [15]. Similar approaches based on computing visual or semantic typicality vocabularies and measuring per image the frequency of these words or concepts are introduced by Posner [16], Bosch [17], and Vogel [18]. A local features based technique used in this paper for comparison is proposed by Lazebnik et. al [19]. They partition an image into increasingly fine sub-regions, the so-called spatial pyramid, and represent these regions through SIFT descriptors. In [6] the application of such local features for place categorization on a mobile robot in a real environment is tested with moderate success. So far, to the best of our knowledge, 3D information only has been used for detecting objects [20–22] or classifying each 3D point into given classes which could be “vegetation”, “facade”, ... [23], “chair”, “table”, ... [24], or “plane”, “sphere”, “cylinder”, ... [25]. The introduced features are of local nature since each 3D point is represented by a feature vector computed from the point’s neighborhood. They are not applicable as we aim for classifying a set of points as a whole.
3
The 3D Indoor Database
In order to systematically study room categorization based on the 3D geometry of rooms a sufficiently robust method for extracting 3D data is necessary. Even though recent papers show an impressive performance in extracting the 3D room frame [26, 27] or depth-ordered planes from a single 2D image [28] they are at the moment not precise enough in computing all spatial structures given by, e.g., the furniture, on the desired level of detail. Therefore, existing 2D indoor databases cannot be utilized. Recording rooms in 3D could be done in principle by stereo camera systems, laser range finders, or Time-of-Flight (ToF) cameras. As room structures are often homogeneous in color and less textured stereo cameras will not provide enough data. In order to acquire instantaneously dense 3D point clouds from a number of rooms laser range finders are unhandy and not fast enough. Thus, we use ToF cameras to acquire appropriate data but our approach is not limited to this kind of data. The Swissranger SR3100 from Mesa Imaging1 is a near-infrared sensor delivering in real-time a dense depth map of 176 × 144 pixels resolution [29]. The depth is estimated per pixel from measuring the time difference between the signal sent and received. Besides the 3D point cloud (→ Fig. 3b), the camera delivers a 2D amplitude image (→ Fig. 3a) which looks similar to a normal gray-scale image but encodes for each pixel the amount of reflected infrared light. The noise ratio of the sensor depends on the reflection 1
http://www.mesa-imaging.ch
204
A. Swadzba and S. Wachsmuth
bath.1
bath.2
bath.3
bed.1
bed.2
bed.3
bed.4
eat.4
eat.5
eat.6
kit.1
eat.1
eat.2
kit.2
kit.3
kit.4
kit.5
kit.6
kit.7
liv.1
liv.2
liv.3
liv.4
liv.5
off.1
off.2
off.3
eat.3
Fig. 2. Photos of 28 different rooms scanned in 3D by a Swissranger SR3100. The images were taken for visualization at the position of the 3D camera. The database contains 3 bathrooms, 4 bedrooms, 6 eating places, 7 kitchens, 5 livingrooms, and 3 offices.
properties of the materials contained in the scene. Points measured from surfaces reflecting infrared light in a diffuse manner have a deviation of 1 to 3 cm. For acquiring many different arranged rooms per category, we recorded data in a regular IKEA home-center2 . Its exhibition is ideal for our purpose because it is assembled by 3D boxes representing single rooms. Each box is furnished differently. We placed the 3D camera at an arbitrary position of the open side and scanned the room for 20 to 30 seconds while the camera is continuously moved by round about 40◦ left/right and 10◦ up/down. This simulates a robot moving its head for perceiving the entire scene. We acquired data for 3 bathrooms, 4 bedrooms, 6 eating places, 7 kitchens, 5 livingrooms, and 3 offices3 . Fig. 2 shows photos of the scanned rooms taken at the position of the 3D camera.
4
Learning a Holistic Scene Representation
This section presents the computation of our novel 3D scene descriptor and the 2D Gist descriptor from a Swissranger frame F . Further, details are given on learning of room models from these feature types and the combination of classification results from both features and several consecutive frames to achieve a robust scene categorization on data of a so far unseen room. Scene Features From 3D Data. As man-made environments mostly consist of flat surfaces a simplified representation of the scene is generated by extracting bounded planar surfaces from the 3D point cloud of a frame. Noisy data is 2 3
http://www.ikea.com/us/en/ http://aiweb.techfak.uni-bielefeld.de/user/aswadzba
Indoor Scene Classification Using Combined 3D and Gist Features
(a) liv.5: frame 162
(b)
205
(c)
Fig. 3. (a): Swissranger amplitude image; (b): corresponding 3D point cloud with extracted planar surfaces (highlighted by color), an angle between two planes; a box enclosing a plane with its edges parallel to the two largest variance directions, and a plane transformed so that the vectors indicating the two largest variance directions in the data are parallel to the coordinate axis; (c): the responses of the Gabor filters applied to the amplitude image are averaged to one image visualizing the information encoded by the Gist feature vector.
smoothed by applying median filters to the depth map of each frame. Artifacts at jumping edges (which is a common problem in actively sensing sensors) are removed by declaring points as invalid if they lie on edges in the depth image. A normal vector ni is computed for each valid 3D point pi via Principal Component Analysis (PCA) of points in the neighborhood N5×5 in the Swissranger image plane. The decomposition of the 3D point cloud into connected planar regions is done via region growing as proposed in [30] based on the conormality measurement defined in [31]. Iteratively, points are selected randomly as seed of a region and extended with points of its N3×3 neighborhood if the points are valid and conormal. Two points p1 and p2 are conormal, when their normals n1 and n2 hold:
α = (n1 , n2 )
arccos(n1 · n2 ) π − arccos(n1 · n2 )
: arccos(n1 · n2 ) <= : else
π 2
with α < θα
(1)
pi Here, we have chosen θα = 18 . The resulting regions are refined by some runs of the RANSAC algorithm [32]. Fig. 3b displays for a frame F the resulting set of planes {Pj |nj · p − dj = 0}j=1...m where nj is computed through PCA of points {pi } assigned to Pj and dj by solving the plane equation for the centroid of these plane points. In the following, the computation of a 3D feature vector x3D based on the extracted planar patches in F is described. This vector encodes the 3D geometry of the perceived scene independent from colors and textures, the view point, and the camera orientation. For m planes {Pj }j=1,...,m in frame F shape and size characteristics and for all plane pairs {(Pk , Pl )}k,l∈(1...m),k=l angle and size ratio characteristics are computed. For each plane characteristic, cs , cA , c÷ , and c , the resulting values are binned into a separated histogram which is normalized to length 1 through dividing the histogram by the number of planes or plane pairs, respectively. Then, the resulting four vectors are concatenated to one feature vector encoding general spatial properties of the scene like the orientation of the patches to each other, their shapes, and their size characteristics:
206
A. Swadzba and S. Wachsmuth
x3D = x , xs , xA , x÷ , with min(a, b) 1 cs = → xs = · Hs ({csj }), j = 1 . . . m, max(a, b) m 1 cA = a · b → x A = · HA ({cA j = 1 . . . m, j }), m A min(cA 2 k , cl ) c÷ · H÷ ({c÷ k, l = 1 . . . m, → x÷ = kl = kl }), A m(m − 1) max(ck , cA ) l 2 c · H ({c → x = k, l = 1 . . . m. kl = (nk , nl ) kl }), m(m − 1)
(2) (3) (4) (5) (6)
Shape characteristic cs . Each planar patch P is spanned by the points assigned to this patch during plane extraction. PCA of these points provides three vectors a, b, n with a indicating the direction of the largest variance in the data and b the direction orthogonal to a with the second largest variance. n gives the normal vector of the planar patch. Fig. 3b shows a planar patch P transformed so that a and b are parallel to the coordinate axes. A minimum bounding box is computed with its orientation parallel to a and b including all plane points. The height of the box is assumed to be a and the width b. As can be seen in Eq. 3 the feature vector xs is a normalized histogram Hs over {csj } with h = 5 bins in the interval ]0, 1[ and a bin width of Δh = 0.2. It encodes the distribution of shapes which range from elongated (cs near 0) to squarish (cs near 1). Size characteristic cA . Considering the minimum bounding box of P the covered area can be easily computed. The normalized histogram HA (→ Eq. 4) consists of h = 6 bins with the boundaries [0; (25cm)2 [, [(25cm)2 ; (50cm)2 [, [(50cm)2 ; (100cm)2 [, [(100cm)2 ; (200cm)2 [, [(200cm)2 ; (300cm)2 [, and [(300cm)2 ; ∞[. Size ratio characteristic c÷ . For a plane pair (Pk , Pl ) the size ratio denotes with a value near 0 that two planes significantly differ in their size whereas the planes cover nearly the same area if the value is near 1. The histogram H÷ is designed equally to histogram Hs (h = 5, Δh = 0.2). Assuming m(m−1) pairs of planes 2 the feature vector is computed using Eq. 5. Angle characteristic c . Using Eq. 1, the acute angle between two planes (Pk , Pl ) can be computed using the planes’ π normals. Eq. 6 gives the corresponding histogram H (h = 9, Δh = 18 , [0, π2 ]). Scene Features From 2D Data. Due to the fact that the ToF camera delivers additionally to the 3D point cloud an amplitude image which can be treated like a normal grayscale image we are able to compute Torralba’s Gist features4 . As the feature vector x3D captures information about the arrangement of planar patches in the scene, xGist encodes additional global scene information. These Gist features are texture features computed from a wavelet image decomposition [14]. Each image location is represented by the output of filters tuned to different orientations and scales. Here, we used 8 orientations and 4 scales applied to a 144 × 144 clipping of the 176 × 144 amplitude image bilinear resized 4
http://people.csail.mit.edu/torralba/code/spatialenvelope
Indoor Scene Classification Using Combined 3D and Gist Features
207
to 256 × 256. The resulting representation is downsampled to 4 × 4 pixels. Thus, the dimensionality of xGist is 8 × 4 × 16 = 512. Fig. 3c shows the average of filter t responses on the amplitude image. Since the depth values of the Swissranger are organized in a regular 2D grid this so-called depth image can also be used to compute in the same way a so-called Depth-Gist feature vector, xDGist . It is an alternative approach to compute features from 3D data. Training Room Models and Combining Single Classifications. In the learning stage of our system, we decided to learn a discriminant model for each class which can be used to compute the distance of a feature vector to each class boundary. As we use Support Vector Machines (SVM) the class boundaries are defined by the learned support vectors. The learning of a class model is realized by a one-against-all learning technique where feature vectors which are in-class form the positive samples. The same amount of negative examples is sampled uniformly from all other classes. Formally written, for our set of six classes Ω ={bath., bed., eat., kit., liv., off.} we learn a set of discriminant functions G(x) which map the n-dimensional feature vector x on a |Ω|-dimensional distance vector d with di denoting the distance to the class boundary of ωi [33]: (7) G(x) : x → d = d1 , . . . , d6 = di = gi (x) | gi : n → i=1...6 , E d : d → e = e1 , . . . , e6 = 0, ej = 1, 0 with dj = max{di }i=1,...,6(8) . The classification is then realized by our decision function E(.) which maps the distance vector d to a binary vector e where the j-th component is equal 1 if dj = max{. . . , di , . . .} and all other components are set to 0. In our scenario a sequence of frames will be recorded from an unknown room while tilting and panning the camera. A classification result for a single frame might not be very reliable as the view field of the camera is limited so that only a small part of the room can be seen per frame. By considering the classification results of a set of consecutive frames {Ft }t∈Δt , the decision is becoming more reliable. In the following, we are going to present two different fusion schemes, V1×E and V2×E . Voting Scheme V2×E . It relies on the idea to perform a classification decision on the frame level, to sum decisions within the window Δt weighted by their reliability, and to make a second decision on the resulting sum: ¨ Δt = ¨ Δt , d ¨Δt = E d e (9) A dt · E dt , dt = G xt A d = l dj1 − l dj2 , dj1 = max{di },
dj2
t∈Δt
1 . 1 + e−x = max{di \dj1 } l(x) =
(10) (11)
Eq. 11 determines the reliability of a decision by the difference between the value of the best and the value of the second best class. The bigger the difference the more reliable is the decision. For a plausible comparison of different classification results the distances to the class boundaries have to be normalized. For this
208
A. Swadzba and S. Wachsmuth
purpose we have chosen the strictly monotic logistic function l(.) which maps the distances on values in ]0, 1[ resulting in the weighting function A(.) (→ Eq. 10). Voting Scheme V1×E . Making a winner-takes-all class decision based on one frame might be vulnerable to noise. Therefore, an alternative approach is to skip the decision on the frame level. Instead, it passes the support for all classes to the final classification. The benefit of this scheme arises in cases where some frames give more or less equal support for two or more classes. Intuitively, making a hard decision towards one of these classes involves a high risk for misclassification. The idea is to lower this risk by keeping the support for all classes and counting on frames in Δt with significant support for one class providing a reliable class decision. Formulating this mathematically, normalized distance vectors dt are collect over time resulting in an accumulated distance vector d˙ Δt on which the decision function E(.) is applied: e˙ Δt = E d˙ Δt , d˙ Δt = L dt , L d = . . . , l(di ), . . . . (12) t∈Δt
Combining Different Feature Types. The voting and weighting technique based on different sums allows also for a straight forward combination of different feature types with differing dimensions. In our case, a set of discriminant functions G3D (.) is learned based on 3D features and another set GGist (.) based on Gist features . For a frame window {Ft }t∈Δt and a set of feature vectors Gist 3D {x3D t , xt }t∈Δt the corresponding Gist distance vectors are denoted by dt = Gist Gist and d x . The combined decision is formulated as G3D x3D = G t t t follows using the described voting schemes: 3D 3D 3D ¨ = ¨ Gist , d ¨ +d ¨Comb E dt , (13) =E d A d3D e Δt Δt Δt Δt t ¨ Gist d Δt
=
t∈Δt
3D d˙ Δt =
t∈Δt
t∈Δt
3D Gist Comb Gist E dt e˙ Δt = E d˙ Δt + d˙ Δt , A dGist t
(14)
Gist , d˙ Δt = . L d3D L dGist t t
(15)
t∈Δt
Rejection. Depending on the selected window size, the speed of the camera drive, and the current frame rate of the camera the actually acquired frames might only show uninformative scene views or may be disturbed by persons moving in front of the camera. The resulting classifications are unreliable and can be excluded by using the modified decision function E (.) (→ Eq. 16). Classifications are rejected (0) if distances to the class boundary of the best class and the second best class do not differ significantly. During evaluation θrej = 0.05 is used leading to rejection of not more than 20% of the test frames. We apply E (.) only in the final classification stage to vector dΔt . d −d 1 : j1dj j2 > θrej , 1 E d = 0, ej1 , 0 , where ej1 = (16) 0 : else
1
1
0.9
0.9 classification rate
classification rate
Indoor Scene Classification Using Combined 3D and Gist Features
0.8 0.7 0.6 0.5
0.8 0.7 0.6 0.5 0.4
0.4 50
100
150 window size
200
250
50
300
1 0.9 classification rate
1 0.9 0.8 0.7 0.6 0.5 0.4
100
150 window size
200
250
300
(b) voting scheme: V1×E ; rej.: no
(a) voting scheme: V1×E ; rej.: θrej = 0.05
classification rate
209
0.8 0.7 0.6 0.5 0.4
50
100
150 window size
200
250
300
(c) voting scheme: V2×E ; rej.: θrej = 0.05
50
100
150 window size
200
250
300
(d) voting scheme: V2×E ; rej.: no
Fig. 4. This figure shows the development of the classification results when enlarging the window Δt from 1 to 300 frames (→ x-axis). Two voting schemes are evaluated V1×E and V2×E introduced in Eq. 15 and 14. The tested feature vectors and combinations are: green → (x3D , xGist ), orange → (xDGist , xGist ), blue→ xGist , cyan → xDGist , ¯ denotes the mean standard deviation of the classifired → x3D , and black → xSP . σ cation curve. r¯ denotes the mean rejection (rej.) rate. (best viewed in color)
5
Evaluation of Room Type Categorization
This section describes the evaluation of the performance in determining the room type for a set of frames of an unknown room, using our proposed combination of spatial 3D and Gist features (x3D , xGist ), Eq. 14 and Eq. 15. Its performance is compared to classification results using pure x3D or pure xGist . Additionally, the Depth-Gist feature vector xDGist and its combination with the standard Gist feature vector, (xDGist , xGist ), is tested as an alternative combination of features from 2D and 3D data. We also apply a scene classification scheme to our dataset that is based on spatial pyramid matching proposed by Lazebnik et al. [19]. The required dictionary is built on the amplitude images of our training sets with a dictionary size of 100 and 8500-dimensional feature vectors5 , xSP . The evaluation utilizes the 3D indoor database of rooms presented in Sec. 3. All 2D descriptors are computed on the amplitude image to achieve a fair comparison for the 3D descriptor as the same amount of scene details is considered. Further, to avoid a bias arising from chosen test sequences all presented classification 5
http://www.cs.unc.edu/~ lazebnik/
210
A. Swadzba and S. Wachsmuth
results are averaged over 10 training runs. Within each run a room per class is chosen randomly extracting all frames of the room as test sequence while the frames of the remaining rooms form the training set. For a better comparison of the different features SVM models with an RBF kernel are trained where the kernel’s parameters are optimized through a 10-crossvalidation scheme choosing a model with the smallest possible set of support vectors (Occam’s razor law). Here, we utilized the SVMlight implementation of the support vector machine algorithm [34, 35]. Additionally, the models learnt on the IKEA database are tested for their applicability on Swissranger frames acquired in two real flats. An important parameter of our system is the number of consecutive frames in Δt used for determining the current class label through voting. As our sequences are recorded with a camera standing at one or two positions and being moved continuously simulating a robot’s head looking around, the possible values could range from one frame to all frames of the sequence. Performing room category decision on one frame is expected to be very fast but vulnerable to noise. The more frames are considered the more stable the classification result should become. There is a trade-off between getting a room type label quickly and reliably. The curves in Fig. 4 show the development of the classification rates if the window size Δt is enlarged from 1 to 300 frames. As assumed the performance of all features increases if the window gets bigger. Taking a deeper look on all subfigures of Fig. 4 it can be seen that our proposed combination (x3D , xGist ) of spatial 3D and Gist features clearly outperforms the other features under all voting and rejection conditions. E.g., the classification rates in Fig. 4a (mean: 0.84, max: 0.94) are higher than the corresponding values of the second best curve (which are 0.71 and 0.79 using combined Depth-Gist and Gist features). The error is reduced by 45 % and 71 %, respectively. This reduction is especially remarkable, because it points out that features needed to be carefully defined on 3D data in order to capture informative spatial structures in a sufficiently generalized way. Here, we are able to show that this can be done with lower dimensional features. Figure 4 shows also the influence of different voting and rejection schemes on the categorization performance. Skipping the winner-takes-all decision on the frame level, Eq. 15, influences the pure Gist based approaches only marginally while classification rates using x3D are significantly increased from (mean: 0.55, max: 0.60) to (mean: 0.65, max: 0.80). As shown in Fig. 4a the spatial 3D features perform comparable to Gist features, even though the dimensionality of 25 is 20 times smaller than the 512 dimensions of the Gist vectors. Lazebnik’s spatial pyramid features, xSP , show the worst performance even though 8500dimensional feature vectors are defined. The high mean standard deviation value of 0.19 arises from the fact that they work well for one or two classes while they do not work at all for other classes (→ Fig. 5). Rejection of frames when no clear decision can be taken is a reasonable step as some frames will not show meaningful views of a scene. As expected the classification results are improved (compare 4a with 4b and 4c with 4d). The influence of rejection is higher on voting scheme V1×E than on V2×E . Taking decisions on
Indoor Scene Classification Using Combined 3D and Gist Features
211
Fig. 5. This figure shows the confusion matrices of all examined features. Here, the window size for voting over consecutive frames is set to Δt = 60. The average classification rates for individual classes are listed along the diagonal, all other entries show misclassifications.
the frame level leads to clearer decisions on the window level with the drawback of being more vulnerable to noise. In Fig. 5 confusion matrices of the curves in Fig. 4a at window size Δt = 60 are given. Taking into account the frame rate of the camera there is an initial delay of 2 to 6 seconds before the system would deliver class labels or reach its full categorization performance. The confusion matrices of x3D and xGist give an impression of their complementary nature. Therefore, fusion of both features leads to improved indoor scene classification capabilities. To test the generalizability of the room models trained on the entire IKEA database we additionally acquired sequences from rooms (bedroom, eating place, kitchen, and living room) of two real flats, F1 and F2. Per room 300 frames were acquired with the Swissranger positioned at the door opening. The room models are acquired using 3D and Gist features combined with voting scheme V1×E over a voting window Δt = 60. Fig. 6 shows pictures of the rooms and the distribution of the resulting class labels using the learnt room models. The continuous red line denotes the ground truth labeling and the black circles mark the
212
A. Swadzba and S. Wachsmuth
Fig. 6. This figure shows the classification results of test sequences (bed.F1, eat.F1, kit.F1, liv.F1, bed.F2, eat.F2, kit.F2, and liv.F2) acquired in two real flat, F1 and F2. The red line refers to the ground truth and the black circles mark the classification results. The shown results are achieved by using 3D features and Gist features in combination, the voting scheme V1×E and a voting window of size Δt = 60. (best viewed in color)
classification results. In general, a quite good overall categorization performance of 0.65 and 0.84 is achieved. Especially, the sequences, kit.F1, kit.F2, liv.F1, liv.F2, and eat.F2 are correct recognized nearly over the whole sequence. Only some frames in the middle of kit.F2 are rejected or wrong classified due to the fact that the camera was directed towards the kitchen window which let to a lot of noisy data because of the disturbing effect of the sun light on the ToF measurement principle. This effect is also responsible for the misclassification of the complete eat.F1-sequence. Suppressing this effect is a subject to a technical solution since there already exists a suppression of background illumination for ToF-sensors [36]. Also, an atypical missing or arrangement of furniture, like only a bed in a room or the bed placed in the corner, leads to misclassifications and rejections as happened for bed.F1 and bed.F2. This problem can be tackled by introducing such rooms into the training of the room models. Finally, we can state that our light-weighted 25-dimensional feature vector, x3D , encodes enough meaningful information about the structural properties of rooms leading to a remarkable good performance in determining the room type for a given sequence of frames. The small dimensionality of our 3D features recommends this features for the use on robots with limited computing resources and restricted capabilities to acquire thousands of training images. Complementary information from Gist features xGist computed on the corresponding amplitude images further improves the categorization.
Indoor Scene Classification Using Combined 3D and Gist Features
6
213
Conclusion
In this paper we have shown that feature vectors capturing the 3D geometry of a scene can significantly contribute to solve the problem of indoor scene categorization. We introduce a 3D indoor database and propose a method for defining features on extracted planar surfaces that encode the 3D geometry of the scene. Additionally, we propose voting techniques which allows to fuse single classification results from several consecutive frames and different feature types like, e.g., 3D features and 2D Gist features. In our experiments, we have been able to show that Gist features and 3D features complement one another when combined through voting. The 3D cue leads earlier to better labeling results. As the fusion of classification results from consecutive frames provide a more stable scene categorization result we plan to investigate the categorization performance on registered 3D point clouds. Such a registered cloud holds a bigger part of the scene which should lead to a more stable classification ability compared to the current situation where the scene classification has to be done on a partial scene view. Also we would like to introduce class hierarchies where on the top level the room types like kitchen are located and on the lower levels subparts of the scenes like “a wall with bookshelves” or “sideboard-like-furniture for placing things on it”. Acknowledgment. This work is funded by the Collaborative Research Center CRC 673 “Alignment in Communication”.
References 1. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV 42, 145–175 (2001) 2. Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: CVPR (2009) 3. Henderson, J.M., Larson, C.L., Zhu, D.C.: Full scenes produce more activation than close-up scenes and scene-diagnostic objects in parahippocampal and retrosplenial cortex: An fMRI study. Brain and Cognition 66, 40–49 (2008) 4. Henderson, J.M., Larson, C.L., Zhu, D.C.: Cortical activation to indoor versus outdoor scenes: An fMRI study. Experimental Brain Research 179, 75–84 (2007) 5. Topp, E., Christensen, H.: Dectecting structural ambiguities and transistions during a guided tour. In: ICRA, pp. 2564–2570 (2008) 6. Ullah, M.M., Pronobis, A., Caputo, B., Luo, J., Jensfelt, P., Christensen, H.I.: Towards Robust Place Recognition for Robot Localization. In: ICRA (2008) 7. Ranganathan, A., Dellaert, F.: Semantic modeling of places using objects. In: Robotics: Science and Systems (2007) ´ 8. Mozos, O.M., Stachniss, C., Burgard, W.: Supervised learning of places from range data using adaboost. In: ICRA, pp. 1730–1735 (2005) 9. Zender, H., Mozos, O., Jensfelt, P., Kruijff, G.J.M., Burgard, W.: Conceptual spatial representations for indoor mobile robots. RAS 56, 493–502 (2008) 10. Galindo, C., Saffiotti, A., Coradeschi, S., Buschka, P., Fern´ andez-Madrigal, J.A., Gonz´ alez, J.: Multi-hierarchical semantic maps for mobile robotics. In: IROS, pp. 3492–3497 (2005)
214
A. Swadzba and S. Wachsmuth
11. Vasudevan, S., G¨ achter, S., Nguyen, V., Siegwart, R.: Cognitive maps for mobile robots - an object based approach. RAS 55, 359–371 (2007) 12. Kim, S., Kweon, I.S.: Scene interpretation: Unified modeling of visual context by particle-based belief propagation in hierarchical graphical model. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 963–972. Springer, Heidelberg (2006) 13. Pirri, F.: Indoor environment classification and perceptual matching. In: Intl. Conf. on Knowledge Representation, pp. 30–41 (2004) 14. Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: ICCV (2003) 15. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR (2005) 16. Posner, I., Schr¨ oter, D., Newman, P.M.: Using scene similarity for place labelling. In: Intl. Symp. on Experimental Robotics, pp. 85–98 (2006) 17. Bosch, A., Zisserman, A., Mu˜ noz, X.: Scene classification using a hybrid generative/discriminative approach. PAMI 30, 712–727 (2008) 18. Vogel, J., Schiele, B.: A semantic typicality measure for natural scene categorization. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 195–203. Springer, Heidelberg (2004) 19. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006) 20. Huber, D.F., Kapuria, A., Donamukkala, R., Hebert, M.: Parts-based 3D object classification. In: CVPR, pp. 82–89 (2004) 21. Cs´ ak´ any, P., Wallace, A.M.: Representation and classification of 3D objects. Systems, Man, and Cybernetics 33, 638–647 (2003) 22. Johnson, A.E., Herbert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. PAMI 21, 433–449 (1999) 23. Munoz, D., Bagnell, J.A., Vandapel, N., Hebert, M.: Contextual classification with functional max-margin markov networks. In: CVPR (2009) ´ 24. Triebel, R., Schmidt, R., Mozos, O.M., Burgard, W.: Instance-based amn classification for improved object recognition in 2D and 3D laser range data. In: IJCAI, pp. 2225–2230 (2007) 25. Rusu, R.B., Marton, Z.C., Blodow, N., Beetz, M.: Learning informative point classes for the acquisition of object model maps. In: Intl. Conf. on Control, Automation, Robotics and Vision, Hanoi, Vietnam (2008) 26. Lee, D., Hebert, M., Kanade, T.: Geometric reasoning for single image structure recovery. In: CVPR (2009) 27. Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: ICCV (2009) 28. Yu, S.X., Zhang, H., Malik, J.: Inferring spatial layout from a single image via depth-ordered grouping. In: CVPR Workshops (2008) 29. Weingarten, J., Gruener, G., Siegwart, R.: A state-of-the-art 3D sensor for robot navigation. In: IROS, vol. 3, pp. 2155–2160 (2004) 30. Hois, J., W¨ unstel, M., Bateman, J., R¨ ofer, T.: Dialog-based 3D-image recognition using a domain ontology. In: Barkowsky, T., Knauff, M., Ligozat, G., Montello, D.R. (eds.) Spatial Cognition 2007. LNCS (LNAI), vol. 4387, pp. 107–126. Springer, Heidelberg (2007) 31. Stamos, I., Allen, P.K.: Geometry and texture recovery of scenes of large scale. CVIU 88, 94–118 (2002)
Indoor Scene Classification Using Combined 3D and Gist Features
215
32. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 381–395 (1981) 33. Kuncheva, L.I.: Combining Pattern Classifiers. John Wiley & Sons, Chichester (2004) 34. Vapnik, V.N.: The Nature of Statistical Learning Theory (1995) 35. Joachims, T.: Learning to Classify Text Using Support Vector Machines. PhD thesis, Cornell University (2002) 36. M¨ oller, T., Kraft, H., Frey, J., Albrecht, M., Lange, R.: Robust 3D measurement with PMD sensors. In: 1st Range Imaging Research Day at ETH (2005)
Closed-Form Solutions to Minimal Absolute Pose Problems with Known Vertical Direction Zuzana Kukelova1 , Martin Bujnak2 , and Tomas Pajdla1 1
Center for Machine Perception, Czech Technical University in Prague 2 Bzovicka 24, 85107, Bratislava, Slovakia
Abstract. In this paper we provide new simple closed-form solutions to two minimal absolute pose problems for the case of known vertical direction. In the first problem we estimate absolute pose of a calibrated camera from two 2D-3D correspondences and a given vertical direction. In the second problem we assume camera with unknown focal length and radial distortion and estimate its pose together with the focal length and the radial distortion from three 2D-3D correspondences and a given vertical direction. The vertical direction can be obtained either by direct physical measurement by, e.g., gyroscopes and inertial measurement units or from vanishing points constructed in images. Both our problems result in solving one polynomial equation of degree two in one variable and one, respectively two, systems of linear equations and can be efficiently solved in a closed-form. By evaluating our algorithms on synthetic and real data we demonstrate that both our solutions are fast, efficient and numerically stabled.
1
Introduction
Cheap consumer cameras become precise enough to be used in many computer vision applications, e.g. structure from motion [2, 15, 22, 23], or recognition [13, 14]. The good precision of these cameras allows to simplify camera models and therefore also algorithms used in these applications. For instance, for the pinhole camera model it is common to set the camera skew to zero, the pixel aspect ratio to one and the principal point to the center of the image. Moreover, since most of the digital cameras put the information about the focal length into the image header (EXIF), it is frequently assumed that this is a good approximation of the whole internal calibration of the camera and therefore the camera is considered calibrated. These assumptions allow to use simpler algorithms, e.g. the 5-point relative pose algorithm for calibrated cameras [19, 24], and as it was shown in [12, 15, 16, 22, 23] they work well even when these assumptions are not completely satisfied. More and more cameras become also equipped with inertial measurement units (IMUs) like gyroscopes and accelerometers, compasses, or GPS devices and add information from these devices into the image header too. This could be even more observable in modern cellular phones and other smart devices. Unfortunately, since GPS position precision is in order of meters, it is not sufficient to provide external camera calibration, i.e. the absolute position of the R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 216–229, 2011. Springer-Verlag Berlin Heidelberg 2011
Closed-Form Solutions to Minimal Absolute Pose Problems
217
camera. Neither compass accuracy (which could provide cameras yaw) is good enough to provide camera orientation since its sensor is highly influenced by magnetic field disturbances. On the other hand currently available IMUs (even the low cost ones) provide very accurate roll and pitch angle, i.e. the vertical direction. The angular accuracy of roll and pith angle in low cost IMUs like [26] is about 0.5◦ , and in high accuracy IMUs is less than 0.02◦ . While most today cameras use information from gyroscopes and IMUs only to distinguish whether the image orientation is landscape or portrait, we show that knowing the camera “up vector” can radically simplify the camera absolute pose problem. Determining the absolute pose of a camera given a set of n correspondences between 3D points and their 2D projections is known as the Perspective-n-Point (PnP) problem. This problem is one of the oldest problems in computer vision with a broad range of applications in structure from motion [2, 15] or recognition [13, 14]. PnP problems for fully calibrated cameras for three and more than three points have been extensively studied in the literature. One of the oldest papers on this topic dates already to 1841 [7]. Recently a large number of solutions to the calibrated PnP problems have been published [5, 18, 20, 21] including several solutions to the PnP problems for partially calibrated cameras [1, 3, 9, 25]. These solutions assume unknown focal length [1, 3, 25], unknown focal length and radial distortion [9] or unknown focal length and principal point [25]. In this paper we provide new simple closed-form solutions to two minimal absolute pose problems for cameras with known rotation around two angles, i.e. known vertical direction. In the first problem we estimate the absolute pose of a calibrated camera from two 2D-3D correspondences and a given vertical direction. In the second problem we assume camera with unknown focal length and radial distortion and estimate its pose together with the focal length and the radial distortion from three 2D-3D correspondence and a given vertical direction. In both cases this is the minimal number of correspondences needed to solve these problems. Both these problems result in solving one polynomial equation of degree two in one variable and one respectively two systems of linear equations and can be efficiently solved in a closed-form way. By evaluating our algorithms on synthetic and real data we show that both our solutions are fast, efficient and numerically stable. Moreover, both presented algorithms are very useful in real applications like structure from motion, surveillance camera calibration or applications of registering photographs taken by a camera mounted on a car moving on a plane. Our work builds on recent results from [10] where the efficient algorithm for relative pose estimation of a camera with known vertical direction from three point correspondences was presented. It was demonstrated that in the presence of good vertical direction information this algorithm is more accurate than the classical 5-point relative pose algorithm [19, 24] for calibrated cameras. Next we provide the formulation of our two problems and show how they can be solved
218
2
Z. Kukelova, M. Bujnak, and T. Pajdla
Problem Formulation
In this section we formulate two problems of determining absolute pose of a camera given 2D-3D correspondences and the vertical direction, respectively rotation angles around two axes. Problem 1. Given the rotation of the calibrated camera around two axes and the images of two known 3D reference points, estimate the absolute position of the camera and the rotation of the camera around the third axis (y-axis). Problem 2. Given the rotation of the camera with unknown focal length around two axes and the distorted images of three known 3D reference points, estimate the absolute position of the camera, the rotation of the camera around the third axis (y-axis), the unknown focal length and the parameter of radial distortion. In both problems we use the standard pinhole camera model [8]. In this model the image projection u of a 3D reference point X can be written as λ u = P X,
(1)
where P is a 3 × 4 projection matrix, λ is an unknown scalar value and points u = [u, v, 1] and X = [X, Y, Z, 1] are represented by their homogeneous coordinates. The projection matrix P can be written as P = K [R | t],
(2)
where R is a 3×3 rotation matrix, t = [tx , ty , tz ] contains the information about camera position and K is the 3 × 3 calibration matrix of the camera. In our problems we assume that we know the vertical direction, i.e. the co ordinates of the world vector [0, 1, 0] in the camera coordinate system. This vector can be often obtained from the vanishing point or directly from the IMU by a very simple calibration of the IMU w.r.t. the world coordinate system. It is important that all what has to be known to calibrate the IMU w.r.t. the world coordinate system, is the direction of the “up-vector”, i.e. the direction which is perpendicular to the “ground plane”, i.e. the plane z = 0 in the world coordinate system. This direction is easy to calibrate by reseting the IMU when lied down on the “ground plane”. Note that this “ground plane” need not be horizontal. Moreover, very often IMU’s are precalibrated to use the gravity vector as the up-vector. Then, the IMU is calibrated when the plane z = 0 of the world coordinate system corresponds to the ground plane, i.e. if the coordinates z = 0 are assigned to the points on the ground plane. From the “up-vector” returned by the IMU rotation we can compute the rotation of the camera around two axes, in this case the x-axis (φx ) and the z-axis (φz ). Note that IMU sometimes returns directly these two angles. We will denote this rotation as ⎤ ⎡ ⎤⎡ cos φz − sin φz 0 1 0 0 Rv = ⎣ sin φz cos φz 0 ⎦ ⎣ 0 cos φx − sin φx ⎦ . (3) 0 sin φx cos φx 0 0 1
Closed-Form Solutions to Minimal Absolute Pose Problems
219
Therefore, the only unknown parameter in the camera rotation matrix R is the rotation angle φy around the y-axis, i.e. the vertical axis and we can write R = R (φy ) = Rv Ry (φy ) ,
(4)
where Rv is the known rotation matrix (3) around the vertical axis and Ry (φy ) is the unknown rotation matrix around the y-axis of the form ⎤ ⎡ cos φy 0 − sin φy ⎦. 0 (5) Ry = ⎣ 0 1 sin φy 0 cos φy With this parametrization the projection equation (1) has the form λu = K [R (φy ) | t]X = K [Rv Ry (φy ) | t]X,
(6)
For our problems we “simplify” the projection equation (6) by eliminating the scalar value λ and trigonometric functions sin and cos. φ To eliminate sin and cos we use the substitution q = tan 2y for which it holds that cos φy =
1−q2 1+q2
and sin φy =
2q 1+q2 .
Therefore we can write ⎡ ⎤ 1 − q2 0 −2q 1 + q 2 Ry (q) = ⎣ 0 1 + q 2 0 ⎦ . 2q 0 1 − q2
(7)
The scalar value λ from the projection equation (6) can be eliminated by multiplying equation (6) with the skew symmetric matrix [u]× . Since [u]× u = 0 we obtain the matrix equation [u]× K [Rv Ry (q) | t]X = 0.
(8)
This matrix equation results in three polynomial equations from which only two are linearly independent. This is caused by the fact that the skew symmetric matrix [u]× has rank two. Polynomial equations (8) are the basic equations which we use for solving our two problems. Next we describe how these equations look for our two problems and how they can be solved.
3
Absolute Pose of a Calibrated Camera with Known Up Direction
In the case of Problem 1 for calibrated camera the calibration matrix K is known and therefore the projection equation (8) [u]× K [Rv Ry (q) | t]X = 0 has the form ⎡ ⎤ ⎤ X ⎡ ⎤⎡ 0 −1 v r11 (q) r12 (q) r13 (q) tx ⎢ ⎥ ⎣ 1 0 −u ⎦ ⎣ r21 (q) r22 (q) r23 (q) ty ⎦ ⎢ Y ⎥ = 0, (9) ⎣Z⎦ r31 (q) r32 (q) r33 (q) tz −v u 0 1
220
Z. Kukelova, M. Bujnak, and T. Pajdla
where u is a calibrated image point and rij (q) are the elements of the rotation matrix R = Rv Ry (q). Note that these elements are quadratic polynomials in q. φ In this case we have four unknowns tx , ty , tz , q = tan 2y and since each 2D3D point correspondence gives us two constraints, two independent polynomial equations of the form (9), the minimal number of point correspondences needed to solve this problem is two. Matrix equation (9) gives us two linearly independent equations for each point and each from this equations contains monomials q 2 , q, tx , ty , tz , 1. Since we have two points, we have four of these equations and therefore we can use three from them to eliminate tx , ty and tz from the fourth one. In this way we obtain one polynomial in one variable q of degree two. Such equation can be easily φ solved in a closed-form and generally results in two solutions for q = tan 2y . Backsubstituting these two solutions to the remaining three equations give us three linear equations in three variables tx , ty and tz . By solving these linear equations we obtain solutions to our problem.
4
Absolute Pose of a Camera with Unknown Focal Length and Radial Distortion and Known Camera Up Direction
In Problem 2 we assume that the skew in the calibration matrix K is zero, the aspect ratio is one and the principal point is in the center of the image. Modern cameras are close to these assumptions and as it will be shown in experiments these assumptions give good results even when they are not exactly satisfied. The only parameter from K which cannot be safely set to a known value is the focal length f . Here we assume that the focal length f is unknown and therefore the calibration matrix K has the form diag [f, f, 1]. Since the projection matrix is given only up to a scale we can equivalently write K = [1, 1, w] for w = 1/f . For this problem we further assume that the image points are affected by some amount of radial distortion. Here we model this radial distortion by the one-parameter division model proposed in [6]. This model is given by formula (10) pu ∼ pd / 1 + krd2 ,
where k is the distortion parameter, pu = [uu , vu , 1] , resp. pd = [ud , vd , 1] , are the corresponding undistorted, resp. distorted, image points, and rd is the radius of pd w.r.t. the distortion center. We assume that the distortion center is in the center of the image and therefore rd2 = u2d + vd2 . Since the projection equation (8) contains undistorted image points u but we measure distorted ones we need to put the relation (10) into equation (8). ˆ i = [ˆ ui , vˆi , 1] be the ith measured distorted image point, projection of the Let u
3D point Xi = [Xi , Yi , Zi , 1], and let rˆi2 = uˆ2i + vˆi2 . Then ui = uˆi , vˆi , 1 + kˆ ri2 and the projection equation (8) for the ith point has the form
Closed-Form Solutions to Minimal Absolute Pose Problems
⎡
0 −1 − ⎣ 1 + kˆ ri2 0 −ˆ vi u ˆi
kˆ ri2
221
⎡ ⎤ ⎤ Xi vˆi 1 0 0 r11 (q) r12 (q) r13 (q) tx ⎢ ⎥ Yi ⎥ −ˆ ui ⎦⎣ 0 1 0 ⎦⎣ r21 (q) r22 (q) r23 (q) ty ⎦⎢ ⎣ Zi ⎦ = 0, (11) 0 0 0 w r31 (q) r32 (q) r33 (q) tz 1 ⎤⎡
⎤⎡
where again rij (q) are the elements of the rotation matrix R = Rv Ry (q). φ In this case we have six unknowns tx , ty , tz , q = tan 2y , k, w = 1/f and since each 2D-3D point correspondence gives us two constraints the minimal number of point correspondences needed to solve this problem is three. For three point correspondences the matrix equation (11) gives us nine polynomial equations from which only six are linearly independent. Let denote these nine polynomial equations g1i = 0, g2i = 0 and g3i = 0 for i = 1, 2, 3, where g1i = 0 is the equation corresponding to the first row in the matrix equation (11), g2i = 0 to the second and g3i = 0 to the third row. Now consider the third polynomial equation from the matrix equation (11), i.e. the equation g3i = 0. This equation has the form g3i = c1 q 2 + c2 q + c3 tx + c4 ty + c5 = 0,
(12)
where cj , j = 1, . . . , 5 are known coefficients. The polynomial equation (12) doesn’t contain variables tz , k and w. Since we have three 2D-3D correspondences we have three equations g3i = 0, i = 1, 2, 3. Therefore, we can use two from these equations to eliminate tx and ty from the third one, e.g. from the g31 = 0. In this way we obtain one polynomial equation in one variable q of degree two. Such equation can be easily solved in a closed-form and generally results in two φ solutions for q = tan 2y . Backsubstituting these two solutions to the remaining two equations corresponding to the third row of (11), i.e. g32 = 0 and g33 = 0, gives us two linear equations in two variables tx and ty which can be again easily solved. After obtaining solutions to q, tx and ty we can substitute them to the equations corresponding to the first row of (11). These equations have after the substitution the form g1i = c6 tz w + c7 w + c8 k + c9 = 0
(13)
for i = i, . . . , 3 and known coefficients cj , j = 6, . . . , 9. Equations 13 can be again solved in a linear way. This is because we can consider the monomial tz w as a new variable and solve a linear system of three equations in three variables. Solving this linear system together with solutions to q, tx and ty gives us in general two solutions to our absolute pose problem for camera with unknown focal length and radial distortion given camera rotation around two axis.
5
Experiments
We have tested both our algorithms on synthetic data (with various levels of noise, outliers, focal lengths and radial distortions and different angular deviation
222
Z. Kukelova, M. Bujnak, and T. Pajdla
of the vertical direction) and on real datasets and compared them with the P3P algorithm for calibrated camera presented in [5], the P4P algorithm for camera with unknown focal length from [3] and with the P4P algorithm for camera with unknown focal length and radial distortion [9]. In all experiments we denote our new algorithms as up2p and up3p+f+k, and compared algorithms as p3p [5], p4p+f [3] and p4p+f+k [9]. 5.1
Synthetic Data Set
We initially studied our algorithms on synthetically generated ground-truth 3D scenes. These scenes were generated randomly with the Gaussian distributions in a 3D cube. Each 3D point was projected by a camera, where the camera orientation and position were selected randomly but looking on the scene. Then the radial distortion using the division model [6] was added to all image points to generate noiseless distorted points. Finally, Gaussian noise with standard deviation σ was added to the distorted image points assuming a 1000 × 1000 pixel image. Noise free data set. In the first synthetic experiment, we have studied the numerical stability of both proposed solvers on exact measurements and compared them with the algorithms [3, 5, 9]. In this experiment 1000 random scenes and camera poses were generated. The radial distortion parameter was set to kgt = −0.2 and the focal length to fgt = 1.5. In the case of our calibrated up2p algorithm, the p3p algorithm from [5] and the p4p+f algorithm [3] these values were used to precalibrate cameras. The rotation error was measured as the rotation angle in the angle-axis rep−1 and the translation error as the angle resentation of the relative rotation RRgt between ground-truth and estimated translation vector. In this case the rotation and translation errors for both our algorithms were under the machine precision and therefore we were not able to properly display them here. Figure 1 (Left) shows the log10 relative error of the focal length f obtained by selecting the real root closest to the ground truth value fgt = 1.5. The log10 absolute error of the radial distortion parameter is in Figure 1 (Right). 100 80
p4pf up3p p4pfk
200 150
up3p p4pfk
60 100
40
50
20 0
−16 −14 −12 −10 −8 −6 Log10 relative focal length error
0
−15 −10 −5 Log10 absolute radial distrion error
Fig. 1. Log10 relative error of the focal length f (Left) and Log10 absolute error of the radial distortion parameter k (Right) obtained by selecting the real root closest to the ground truth value fgt = 1.5 and kgt = −0.2
Closed-Form Solutions to Minimal Absolute Pose Problems
223
8 6 4 2 0 0
0.1
0.2
0.5
1
2
up direction angular deviation in degrees
Translation error in degrees
Rotation error in degrees
up2p 10
up2p 10 8 6 4 2 0 0
0.1
0.2
0.5
1
2
up direction angular deviation in degrees
Fig. 2. The influence of the accuracy of the vertical direction on the estimation of rotation (Left) and translation (Right) for our calibrated up2p algorithm
It can be seen that the new proposed up3p+f+k solver for camera with unknown focal length and radial distortion and given vertical direction (Magenta) provides more accurate estimates of both k and f then the p4p+f+k solver presented in [9] (Green) and the p4p+f solver from [3] (Blue). Vertical direction angular deviation. In the real applications the vertical direction obtained either by direct physical measurement by, e.g., gyroscopes and IMUs or from vanishing points constructed in images is not accurate. The angular accuracy of roll and pith angle in low cost IMUs like [26] is about 0.5◦ , and in high accuracy IMUs is less than 0.02◦. Therefore, in the next experiment we have tested the influence of the accuracy of the vertical direction on the estimation of rotation (Figure 2 (Left) and 3 (Top left)), translation (Figure 2 (Right) and 3 (Top right)), focal length (Figure 3 (Bottom left)) and radial distortion (Figure 3 (Bottom right)). In this case the ground truth focal length was fgt = 1.5 and the radial distortion kgt = −0.2. For each vertical direction angular deviation 1000 estimates of both our algorithms, the calibrated up2p (Figure 2) and the up3p+f+k with unknown focal length and radial distortion (Figure 3), were made. The vertical direction angular deviation varied from 0◦ to 2◦ . All results in Figures 2 and 3 are represented by the MATLAB function boxplot which shows values 25% to 75% quantile as a blue box with red horizontal line at median. The red crosses show data beyond 1.5 times the interquartile range. The rotation error is again measured as the rotation angle in the angle-axis −1 and the translation error as the representation of the relative rotation RRgt angle between ground-truth and estimated translation vector. Focal length and radial distortion figures show directly estimated values. It can be seen that even for relatively high error in the vertical direction the median values of estimated focal lengths and radial distortion parameters are very close to the ground truth values (Green horizontal line). Note that the angular rotation and translation errors cannot be smaller than the angular error of the vertical direction.
224
Z. Kukelova, M. Bujnak, and T. Pajdla
6 4 2 0
Focal length estimates
0 0.1 0.2 0.5 1 2 up direction angular deviation in degrees up3p+f+k 3
2
1
0
0 0.1 0.2 0.5 1 2 up direction angular deviation in degrees
Translation error in degrees
8
up3p+f+k 10 8 6 4 2 0 0 0.1 0.2 0.5 1 2 up direction angular deviation in degrees up3p+f+k Radial distortion estimates
Rotation error in degrees
up3p+f+k 10
0.2 0 −0.2 −0.4 −0.6 0 0.1 0.2 0.5 1 2 up direction angular deviation in degrees
Fig. 3. The influence of the accuracy of the vertical direction on the estimation of rotation (Top left), translation (Top right), focal length (Bottom right), radial distortion (Bottom right) for our up3p+f+k algorithm
Data affected by noise. Boxplots in Figure 4 show behavior of both our solvers, the calibrated up2p solver (Cyan) and up3p+f+k solver (Magenta), together with the p3p algorithm [5] (Red), the p4p+f algorithm from [3] (Blue) and the p4p algorithm [9] (Green) in the presence of noise added to image points. In this experiment for each noise level, from 0.0 to 2 pixels, 1000 estimates for random scenes and camera positions and fgt = 1.5, kgt = −0.2, were made.670 Here it can be seen that since our algorithms use less points than [3, 5, 9] they are little bit more sensitive to a noise added to the image points than the algorithms [3, 5, 9]. The difference is only in the rotation error (Figure 4 (Top left)) where both our new solvers outperform existing algorithms [3, 5, 9]. This might be due to the fact, that the rotation axis is fixed in both our algorithms. However, our new algorithms still provide good estimates also for translation (Figure 4 (Top right)), focal length (Bottom left) and radial distortion (Bottom right) in the presence of noise. Ransac experiment. The final synthetic experiment shows behaviour of our algorithms within the RANSAC paradigm [5]. Figure 5 shows the mean value of the number of inliers over 1000 runs of RANSAC as a function of the number of samples of the RANSAC. We again compare our calibrated up2p algorithm (Cyan) with the p3p algorithm for calibrated camera presented in [5] (Red), and our up3p+f+k algorithm for camera with unknown focal length and radial distortion (Magenta) with the p4p+f algorithm from [3] (Blue) and the p4p+f+k algorithm [9] (Green). Figure 5 (Left) shows results for 50% outliers, pixel noise
Closed-Form Solutions to Minimal Absolute Pose Problems
1.5 Translation error in degrees
Rotation error in degrees
1.5
225
1
0.5
0
1
0.5
0 0
0.1 0.5 1 noise in pixels
2
0
0.1 0.5 1 noise in pixels
2
0.1 0.5 1 noise in pixels
2
1.6 Radial distortion estimates
Focal length estimates
−0.14 1.55
1.5
1.45
1.4
−0.16 −0.18 −0.2 −0.22 −0.24 −0.26
0
0.1 0.5 1 noise in pixels
2
0
Fig. 4. Error of rotation (Top left), translation (Top right), focal length estimates (Bottom left) and radial distortion estimates (Bottom right) in the presence of noise for our calibrated up2p algorithm (Cyan), the calibrated p3p algorithm from [5] (Red), our p3p+f+k algorithm (Magenta), the p4p+f algorithm presented in [3] (Blue) and the p4p+f+k algorithm [9] (Green)
0.5px, fgt = 1.5, kgt = −0.2 and the vertical direction angular deviation 0.02◦ and Figure 5 (Right) results for the same configuration but the vertical direction angular deviation 1◦ . These vertical direction angle deviations reflect accessible precisions using low cost and high cost IMUs. Vertical direction precision about 1◦ can be received by taking pictures from hand using standard smartphone. As it can be seen both our new algorithms little bit outperform the remaining algorithms [3, 5, 9]. Computation times. Since both our problems result in solving one polynomial equation of degree two in one variable and one, respectively two, systems of linear equations and can be solved in a closed-form, they are extremely fast. The Matlab implementation of both our solvers runs about 0.1ms. For comparison, our Matlab implementation of the p3p algorithm [5] runs about 0.6ms, the Matlab implementation of the p4p+f algorithm [3] downloaded from [17] about 2ms and the original implementation of the p4p+f+k algorithm [9] even about 70ms. 5.2
Real Data
Synthetically generated vertical direction. In the real experiment we created 3D reconstruction from a set of images captured by an off-the-shelf camera. We used reconstruction pipeline similar to one described in [23]. We take this
Z. Kukelova, M. Bujnak, and T. Pajdla 0.5px noise level, 0.02 degree up direction deviation 300
0.5px noise level, 1.0 degree up direction deviation 300
250
250
200 150 p3p up2p p4pf up3p p4pfk
100 50 0 0
50
100 Ransac cycles
150
200
Mean support
Mean support
226
200 150 p3p up2p p4pf up3p p4pfk
100 50 0 0
50
100 Ransac cycles
150
200
Fig. 5. The mean value of the number of inliers over 1000 runs of RANSAC as a function of the number of samples of the RANSAC
5
0.6 0.4 0.2 0
0.4 Focal length relative error
Translation error in degrees
Rotation error in degrees
1 0.8
4 3 2 1
p3p
up3p+f+k p4p+f
p4p+f+k
0 −0.2 −0.4
0 up2p
0.2
up2p
p3p
up3p+f+k p4p+f
p4p+f+k
up3p+f+k
p4p+f
p4p+f+k
Fig. 6. Results of real experiment for vertical direction randomly rotated with standard deviation 0.5◦
reconstruction as a ground truth reference for further comparison even if it might not be perfect. To simulate a measurement from gyroscope we have extracted “vertical direction” from reconstructed cameras and randomly rotated them by a certain angle. This way we have created 1000 random vertical direction with normal distribution for maximal deviation of 0.02, 0.05 and 0.5 and 1 degrees to simulate industry quality, standard and low cost gyroscope measurement. Note that from the 3D reconstruction we have a set of 2D-3D tentative correspondences as well as correct correspondences. To make registration more complicated we added more outliers to the set of tentative correspondences for those cameras where fraction of correct features was greater than 50%. We randomly connected 2D measurements and 3D points to make 50% of tentative correspondences wrong (outliers). Next, for each random vertical direction we executed locally optimized RANSAC [4] with both our new algorithms. Further we also evaluated 1000 runs of locally optimized ransacs with the calibrated p3p algorithm [5], the p4p+f algorithm presented in [3] and the p4p+f+k solver from [9] and compared all results with ground truth rotation, translation and the focal length. Note that our calibrated up2p algorithm and the p3p algorithm for calibrated camera from [5] require internal camera calibration. Hence we used calibration obtained from the 3D reconstruction. Results for vertical direction randomly rotated with standard deviation 0.5◦ are in Figure 6. It can be seen that our new algorithms (up2p and up3p+f+k) provide comparable estimates of rotation (Left), translation (Center) and focal
Closed-Form Solutions to Minimal Absolute Pose Problems
227
200 mean number of inliers
mean number of inliers
300 150
100 up2p p3p up3p+f+k p4p+f p4p+f+k
50
0 0
200
400 600 ransac cycles
800
250 200 150 up2p p3p up3p+f+k p4p+f p4p+f+k
100 50 0 0
1000
200
400 600 ransac cycles
800
1000
Fig. 7. Real experiment: Example of input images (Left). Results from RANSAC on the image with small radial distortion (Center) and on the image with significant radial distortion (Right).
1 1
2
2
2
3
3 4
4
1 4
3
Fig. 8. Results of real experiment for real IMU sensor from HTC HD2
length (Right) as algorithms p3p [5], p4p+f [3] and p4p+f+k [9] even for this higher vertical direction deviation. Moreover, our algorithms use less points, give only up to two solutions which have to be verified inside ransac loop and are considerably faster than the algorithms [3, 5, 9]. All these aspects are very important in RANSAC and real applications. The mean value of the number of inliers over 1000 runs of RANSAC as a function of the number of samples of the RANSAC for two cameras and vertical direction deviation 0.5◦ can be seen in Figure 7. It can be seen that both our algorithms outperform existing algorithms [3, 5, 9]. Vertical direction obtained from real IMU. In this experiment we used HTC HD2 phone device to obtain real images with IMU data. We took by hand several images of a book placed on a A4 paper. The image size was 800×600 pixels. We assigned 3D coordinates to the paper corners and manually clicked corresponding 2D points in all input images. First, RANSAC with the new solvers was used. Then, a non-linear refinement [8] optimizing focal length, radial distortion, rotation and translation was performed. For our calibrated up2p algorithm we used radial distortion and focal length calculated using non-linearly improved result obtained from up3p+f+k.
228
Z. Kukelova, M. Bujnak, and T. Pajdla
Figure 8 shows 3D model of the paper and book back projected to images using calculated camera poses. Green wire-frame model of the book shows estimated camera from minimal sample using our algorithms (up2p (Left), up3p+f+k (Center+Right)), red color represent model after non-linear refinement. Magenta lines represent gravity vector obtained from phone device. As it can be seen we have obtained quite precise results (reprojection error about 1.5px) even for standard images taken by hand and without calibration of relative position of IMU sensor and camera inside the phone.
6
Conclusion
In this paper we have presented new closed-form solutions to two minimal absolute pose problems for the case of known vertical direction, i.e. known rotation of the camera around two axes. In the first problem we estimate absolute pose of a calibrated camera and in the second of a camera with unknown focal length and radial distortion. We show that information about the vertical direction obtained by direct physical measurement by, e.g. gyroscopes and inertial measurement units (IMUs) or from vanishing points constructed in images, can radically simplify the camera absolute pose problem. In the case of two presented problems the information about the vertical direction leads to very fast, efficient and numerically stable closed-form solutions. For comparison the only existing algorithm for estimating absolute pose of a camera with unknown focal length and radial distortion [9] without the vertical direction leads to LU decomposition of 1134 × 720 matrix and further eigenvalue computations and is very impractical in real applications. Experiments show that both our algorithms are very useful in real application like structure from motion. Moreover, since still more and more cameras and smart devices are equipped with IMUs and can save information about the vertical direction, i.e. roll and pitch angles, to the image header, we see the great future potential of presented solvers. Acknowledgement. This work has been supported by EC project FP7-SPACE218814 PRoVisG, by Czech Government research program MSM6840770038 and by Grant Agency of the CTU Prague project SGS10/072/OHK4/1T/13.
References 1. Abidi, M.a., Chandra, T.: A new efficient and direct solution for pose estimation using quadrangular targets: Algorithm and evaluation. IEEE PAMI 17(5), 534–538 (1995) 2. 2D3. Boujou, www.2d3.com 3. Bujnak, M., Kukelova, Z., Pajdla, T.: A general solution to the P4P problem for camera with unknown focal length. In: CVPR 2008 (2008) 4. Chum, O., Matas, J., Kittler, J.: Locally optimized RANSAC. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 236–243. Springer, Heidelberg (2003)
Closed-Form Solutions to Minimal Absolute Pose Problems
229
5. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981) 6. Fitzgibbon, A.: Simultaneous linear estimation of multiple view geometry and lens distortion. In: CVPR 2001, pp. 125–132 (2001) 7. Grunert, J.A.: Das pothenot’sche Problem, in erweiterter Gestalt; nebst Bemerkungen u ¨ber seine Anwendung in der Geod¨ asie. Archiv der Mathematik und Physik 1, 238–248 (1841) 8. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003) 9. Josephson, K., Byr¨ od, M., ˚ Astr¨ om, K.: Pose Estimation with Radial Distortion and Unknown Focal Length]. In: Proc. Conference on Computer Vision and Pattern Recognition (CVPR 2009), Florida, USA (2009) 10. Kalantari, M., Hashemi, A., Jung, F., Gu´edon, J.-P.: A New Solution to the Relative Orientation Problem using only 3 Points and the Vertical Direction, Computing Research Repository (CoRR), abs/0905.3964 (2009) 11. Kelly, J., Sukhatme, G.S.: Fast Relative Pose Calibration for Visual and Inertial Sensors. Springer Tracts in Advanced Robotics, vol. 54 (2009) 12. Li, X., Wu, C., Zach, C., Lazebnik, S., Frahm, J.-M.: Modeling and Recognition of Landmark Image Collections Using Iconic Scene Graphs. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 427–440. Springer, Heidelberg (2008) 13. Leibe, B., Cornelis, N., Cornelis, K., Gool, L.J.V.: Dynamic 3d scene analysis from a moving vehicle. In: CVPR (2007) 14. Leibe, B., Schindler, K., Gool, L.J.V.: Coupled detection and trajectory estimation for multi-object. In: ICCV (2007) 15. Martinec, D., Pajdla, T.: Robust rotation and translation estimation in multiview reconstruction. In: CVPR (2007) 16. Microsoft Photosynth, http://photosynth.net/ 17. Minimal Problems in Computer Vision, http://cmp.felk.cvut.cz/minimal/ 18. Moreno-Noguer, F., Lepetit, V., Fua, P.: Accurate non-iterative o(n) solution to the pnp problems. In: ICCV (2007) 19. Nister, D.: An efficient solution to the five-point relative pose. IEEE PAMI 26(6), 756–770 (2004) 20. Quan, L., Lan, Z.-D.: Linear n-point camera pose determination. IEEE PAMI 21(8), 774–780 (1999) 21. Reid, G., Tang, J., Zhi, L.: A complete symbolic-numeric linear method for camera pose determination. In: ISSAC 2003, pp. 215–223. ACM, New York (2003) 22. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in 3D. ACM Transactions on Graphics (SIGGRAPH Proceedings) 25(3) (2006) 23. Snavely, N., Seitz, S.M., Szeliski, R.: Modeling the world from Internet photo collections. International Journal of Computer Vision 80(2), 189–210 (2008) 24. Stewenius, H., Engels, C., Nister, D.: Recent developments on direct relative orientation. ISPRS J. of Photogrammetry and Remote Sensing 60, 284–294 (2006) 25. Triggs, B.: Camera pose and calibration from 4 or 5 known 3d points. In: Proc. 7th Int. Conf. on Computer Vision, Kerkyra, Greece, pp. 278–284. IEEE Computer Society Press, Los Alamitos (1999) 26. Xsens, www.xsens.com
Level Set with Embedded Conditional Random Fields and Shape Priors for Segmentation of Overlapping Objects Xuqing Wu and Shishir K. Shah Quantitative Imaging Laboratory Department of Computer Science University of Houston Houston, TX 77204, USA
Abstract. Traditional methods for segmenting touching or overlapping objects may lead to the loss of accurate shape information which is a key descriptor in many image analysis applications. While experimental results have shown the effectiveness of using statistical shape priors to overcome such difficulties in a level set based variational framework, problems in estimation of parameters that balance evolution forces from image information and shape priors remain unsolved. In this paper, we extend the work of embedded Conditional Random Fields (CRF) by incorporating shape priors so that accurate estimation of those parameters can be obtained by the supervised training of the discrete CRF. In addition, non-parametric kernel density estimation with adaptive window size is applied as a statistical measure that locally approximates the variation of intensities to address intensity inhomogeneities. The model is tested for the problem of segmenting overlapping nuclei in cytological images.
1
Introduction
Segmentation of multiple objects in images wherein the objects may overlap or occlude other objects/regions of interest remains a challenging problem. This is certainly the case for analysis of cytological images, wherein the classification of the imaged sample is based on the shape, distribution and arrangement of cells/nuclei, and other contextual information [1]. To automate this classification process, the computer needs to learn cellular characteristics by extracting corresponding image features from the region of interest (ROI). The accuracy of feature extraction depends on the performance of segmentation results. However, reliable segmentation of cytological images remains challenging because of intensity inhomogeneity and the existence of touching and overlapping cells/nuclei. Li et al. [2] gave a brief review on some methodologies towards the development of automated methods for nuclei or cell segmentation. Among all proposed methods, thresholding, watershed, and deformable models or active contours remain the most popular approaches. In general, watershed segmentation is claimed to be more effective in terms of differentiating connected nuclei or cells. However, the watershed model for segmentation of multiple objects is not suitable R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 230–241, 2011. Springer-Verlag Berlin Heidelberg 2011
Level Set with Embedded CRF and Shape Priors
(a)
(b)
231
(c)
Fig. 1. Overlapping ellipse example: there are two objects in this image, A and B
for separating overlapping objects because any segmented pixel/set is generally associated to a single label and a pixel cannot be granted membership to more than one label. For example, fig. 1(a) shows two objects A and B with an overlapping area C. Fig. 1(b) shows the expected segmentation result under the watershed scheme and fig. 1(c) presents the ground truth with two elliptical objects indicated by red and blue contours. Accurate description of the shape is lost in the segmented image shown in fig. 1(b). Coupled curve evolution [3] and multiphase level set [4] provide a powerful framework for segmenting multiple connected objects. In addition, shape priors can also be easily integrated into the level set formulation resulting in a more constrained segmentation [5]. Nonetheless, these models are subject to the same labeling constraints so as to avoid assigning different labels to overlapping regions. A practical solution to overcome this limitation is to evolve each curve independently [6] and let the shape prior control the boundary in the overlapping region. One of the major drawbacks of the aforementioned continuous level set method is the ad-hoc style of the parameter estimation [7]. In order to leverage the power of structured discriminative models like Conditional Random Fields (CRF), Cobzas and Schmidt [7] proposed to embed CRF into the level set formulation to improve edge regularization, incorporate correlations in labels of neighboring sites and evaluate weighting parameters for each energy term by pre-trained discrete CRF models. A key problem that remains unsolved is the explicit integration of shape information in a discrete CRF model so that parameters that balance the influence from image information and shape priors can be evaluated systematically through embedding. Our work addresses this very problem and is distinguished by three key contributions. First, we propose a novel spatial adaptive kernel density estimator to associate local observations with expected labels. Second, we apply the top-down CRF model [8] as a constraint and embed it into the level set formulation as the shape measurement. Third, we solve the parameter estimation problem for the curve evolution with shape priors by embedding potential functions and parameters learned from the discrete CRF model. The rest of the paper is organized as follows. Section 2 briefly introduces the CRF embedding method proposed by Cobzas and Schmidt [7]. The proposed adaptive kernel density estimator and embedded shape prior are presented in section 3. Estimation of parameters and the experimental evaluation of the developed model for the problem of segmenting overlapping nuclei in cytological images are presented in section 4. The paper is concluded in section 5.
232
2
X. Wu and S.K. Shah
Level Set Methods with Embedded CRF
Image segmentation is a fundamental problem in image analysis and can be viewed as classification or labeling of image pixels/regions, wherein y represents a set of labels to be assigned to each pixel represented by its value x. Since most natural images are not random collection of pixels [9], it is important to find an accurate model to describe information in the image to achieve good segmentation accuracy. Both CRF and level set are popular mathematical models used for solving segmentation problems. A binary CRF with weighted Ising potentials for the task of segmentation is formulated as follows [7]: ⎛ ⎞ 1 yi wT fi (x) + (1 − |yi − yj |) vk fijk (x)⎠ , p(y|x, w, v) = exp − ⎝ Z i∈N
i,j∈E
k
(1) where Z is a normalizing constant known as the partition function, N is the number of nodes, fi (x) denotes features for node i, and fijk represents the k th feature for the edge between nodes i and j. Node labels are yi . The association term yi wT fi (x), denoted as Ai from here on, relates the observation x to the label yi . This term corresponds to the level set data energy (Eq. 2) that is inside or outside of the contour. The interaction term (1 − |yi − yj |) k vk fijk (x), denoted by Iij from here on, is a measure of how the labels at neighboring sites should interact. This term is analogous to the regularization energy (Eq. 2) in the level 1 set model. In particular, fijk (x) is defined as 1+|fik (x)−f . Because Eq. 1 is jk (x)| jointly log-concave in w and v, we can find the global maximum in terms of w and v. Level set methods define the evolving curve C in Ω ∈ R2 . C is implicitly represented as the zero-level of an embedding signed distance function φ : Ω → R. In particular, to segment an image into two disjoint regions Ω1 and Ω2 , the energy functional of a continuous model with embedded CRF has the following form: 1 −H(Φ)(wT f ) + (1 − H(Φ))(wT f ) + |∇H(Φ)| vk dx , E(Φ) =
1 + |∇f k| Ω k data term
regularization term
(2) where k is the index for each edge feature and vk ≥ 0. The associated EulerLagrange equation is:
∂Φ 1 ∇Φ T = δ (Φ) −2w f + div , (3) vk ∂t 1 + ∇fk |∇Φ| k
where δ is the regularized version of the delta function δ. The new continuous formulation takes the advantage of the embedded CRF model whose parameters {w, v} can be estimated by the supervised training in its discrete counterpart.
Level Set with Embedded CRF and Shape Priors
3
233
Adaptive Kernel Density Estimation and Embedded Shape Priors
In general, the discriminative model wT f works well when the test image can be well predicated through the supervised regressive training on sample images. However, when the variance of the difference between predicated and observed values is not well contained, the resulting segmentation can be ill-conditioned. This is especially the case for segmenting cells/nuclei from cytological smears because of the intensity inhomogeneity caused by the cell chromatin, the existence of dense intracellular as well as intercellular matter. In this paper, we propose to use a non-parametric density estimation approach to capture the local variation without imposing restrictive assumptions. In addition, we apply the top-down CRF model [8] for the shape constraint and derive its corresponding embedded term in the continuous space. Thus the weighting factor of shape priors can be estimated from the discrete CRF model. 3.1
Adaptive Kernel Density Estimation
Parzen window is a classical non-parametric kernel density estimation method. Nonetheless, the traditional approach of using a fixed window size h imposes negative effects on segmentation results when there are geometrically disjoint regions in the image, because not every data point makes an equal contribution to the final density estimation of a pixel i in the 2D space. For example, pixels located far away from pixel i should weigh less in the formulation. To have the bandwidth change with the spatial distance, we define the adaptive window size hij as:
14 d2ij 1 1 √ exp hij = , (4) N σ π σ2 where dij is the Euclidean distance between pixel i and j in the 2D space, σ is a user defined scale parameter, and N is the size of sample points. Thus, the new estimated intensity density function at pixel i is: N xi − xl 1 1 K pˆ(xi ) = , (5) N hil hil l=1
where K(·) is the Gaussian kernel, and xl is a sample drawn from population with density p(xi ). The accuracy of the estimated density is increased by using small bandwidth in regions that are near the pixel i since samples in the near neighborhood provide better estimation. On the other hand, a larger bandwidth is more appropriate for samples located far away. To be less sensitive to noise, we could convolve image x with a Gaussian kernel so that samples xl in Eq. 5 are replaced by a smoother version x ˜l . Discrete CRF model for segmenting an image into two disjoint regions Ω1 and Ω2 can now be re-written as: ⎛ ⎞ 1 p(y|x, w, v) = exp − ⎝ −wk log p(fi (x)|Ωk ) + Iij ⎠ k ∈ {1, 2} , (6) Z i∈N
i,j∈E
234
X. Wu and S.K. Shah
where −w log p(fi (x)|Ωk ) forms the association term Ai , Iij is the same as in Eq. 1, and p(fi (x)|Ωk ) is estimated using the adaptive kernel as:
Nk 1 1 fi (x) − f˜l (x) p(fi (x)|Ωk ) = K ∀l ∈ Ωk , (7) Nk hil hil l=1
where f˜l (x) is a smoother version of the node feature fl (x). Embedding CRF model of Eq. 6 into the functional in the continuous space, the new contour energy, Ecv , in the level set counterpart takes on the form: −H(Φ)w1 log p(f (x)|Ω1 ) − (1 − H(Φ))w2 log p(f (x)|Ω2 ) Ecv (Φ) = Ω
+|∇H(Φ)|
k
vk
1 dx , 1 + |∇fk |
(8)
whose parameters {w, v} are trained by the discrete CRF. Minimizing the energy in Eq. 8 is accomplished by solving the following Euler-Lagrange equation:
∂Φ 1 ∇Φ log p(f (x)|Ω ) 1 = δ (Φ) wT · . vk + div − log p(f (x)|Ω2 ) ∂t 1 + ∇fk |∇Φ| k (9) Fig. 2 shows that, compared to the local association model with Gaussian region statistics, the non-parametric approach introduced in Eq. 9 is less sensitive to the variation of intensity values. This results in improved segmentation even without the introduction of shape priors for non-occluding objects. We note that there has been a previous proposal, called local binary fitting energy (LBF) model, for fitting energy based intensity information in local regions [10]. Compared to the LBF model, the kernel function K used in Eq. 4 is a well-behaved kernel [11] and the asymptotic convergence of the kernel density estimator is guaranteed [12]. Unlike the LBF model, parameters of Eq. 8 are estimated systematically through the discrete CRF model.
(a)
(b)
(c)
Fig. 2. Comparative results for segmenting images with intensity inhomogeneity (a) Segmentation result using adaptive kernel density estimation. (b) Segmentation result using Gaussian model. (c) Intensity profile of the area inside of the nucleus.
Level Set with Embedded CRF and Shape Priors
3.2
235
Embedded Shape Priors
The occluded boundary of overlapping regions cannot be detected by algorithms that only use edge or region information. Statistical shape models have been developed to overcome these difficulties [5]. Inspired by the work of Levin et al. [8] and Cremers et al. [12], we propose an explicit integration of shape priors into the discrete CRF and derive embedding entities in the level set model. Therefore, parameters that weigh the influence from shape information can be estimated accurately by training the CRF model. Let an implicit representation of a shape be a binary image patch with labeling of {0, 1}. Let φc be a binary image patch representing a shape, φc (i) be the label of pixel i. The distance between two shapes a and b is defined N as d2 (φca , φcb ) = i=1 (φca (i) − φcb (i))2 , where N is the size of the image patch. Given a set of training shapes φci : i = 1...M , the probability density of a shape φc can be estimated as: M 1 1 2 c c p(φ ) = exp − 2 d (φ , φi ) , M i=1 2σ c
(10)
1 M 2 c c where σ 2 can be estimated by M i=1 minj=i d (φi , φj ) [12]. Let Su0 = c −z log p(φu0 ) be the potential associated with the image patch φc centered at u0 with binary label y and p(φcu0 ) is the probability density of a shape φc centered at u0 . Incorporating this into the discrete CRF model results in: ⎛ ⎞ 1 p(y|x, w, v, z) = exp − ⎝ Ai + Iij + Su0 ⎠ . (11) Z i∈N
i,j∈E
Ai and Iij are defined as in Eq. 6. Parameters {w, v, z} are estimated through the training of the discrete CRF model. To embed the shape prior into the level set, we re-define the implicit shape representation φ as the signed distance function. The distance between two shapes a and b in the level set formulation can be given by: 2 (H(φa (x)) − H(φb (x)))2 dx , (12) dL (φa , φb ) = Ω
where H(·) denotes the Heaviside function. The density of a shape in the level set formulation, represented by pL , is given as: M 1 1 2 pL (φu0 ) = exp − 2 dL (φ(x + u0 ), φi (x)) , M i=1 2σ
(13)
where u0 accounts for translation in continuous space. Let the shape energy Eshape = −z log(pL (φu0 )), then the total energy of the contour in the level set formulation is the sum of Ecv defined in Eq. 8 and Eshape , whose parameters {w, v, z} are estimated by the discrete CRF. Total energy is then minimized by:
236
X. Wu and S.K. Shah
∇Φ ∂Φ 1 log p(f (x)|Ω ) 1 vk + div = δ (Φ) wT · + − log p(f (x)|Ω2 ) ∂t 1 + ∇fk |∇Φ| k exp(− 2σ12 d2L (φ(x + u0 ), φi (x)))(H(φ(x + u0 )) − H(φi (x))) L z , (14) 2 σL exp(− 2σ12 d2L (φ(x + u0 ), φi (x))) L
M
1 2 2 is evaluated by M where σL i=1 minj=i dL (φi , φj ). During the computation, Φ is reinitialized to the signed distance function and we assume homogeneous Neumann boundary conditions.
3.3
Construction of Shape Priors
With the objective of segmenting nuclei, a set of contours of elliptical shape priors was generated according to the general parametric form of the ellipse: x(t) = a cos t cos θ − b sin t sin θ (15) y(t) = a cos t sin θ + b sin t cos θ , where t ∈ [0, 2π]. θ is the angle between the X-axis and the major axis of the ellipse. The ratio ab is a measure of how much the conic section deviates from being circular. In this paper, we construct shape prior with ab ∈ {0.4, 0.6, 0.8, 1.0}. The angle of the major axis, θ, was also varied from 0 to π in steps of π8 . Fig. 3 shows the array of elliptical training shapes. Each training shape is represented by a 32 × 32 patch. Rotation invariance is integrated into the estimation of the density of a shape (Eq. 10 or Eq. 13) by using this set of training shapes.
Fig. 3. Training shapes for elliptical objects with different eccentricities and rotation angles
4
Implementation and Experimental Results
Images used for training and testing are obtained from slides of cytological smears that are Papanicolaou or Diff-Quik stained for the purpose of differentiating pathologies in fine needle aspirated cells from thyroid nodules. Images
Level Set with Embedded CRF and Shape Priors
237
are acquired by using an optical microscope and a grayscale CCD camera having 768 × 512 pixels (9 × 9μm) at 8-bit digitization. Manual segmentation was performed to establish the ground truth. The segmentation accuracy is measured by global consistency error, a well-established standard measurement used in image segmentation problem [13]. To evolve the curve according to Eq. 14, an iteration scheme is implemented by discretizing the PDE [4]. The level set function is initialized as a step function that takes negative values inside the contour and positive values outside of the contour. The initial contour is obtained from the segmentation result using watershed and unsupervised clustering [14]. Since the watershed lines separate the image into non-overlapped regions, we can evolve each closed watershed line of a region independently. Four parameters are estimated by training the discrete CRF with shape priors [15]: w1 is related to image information inside of the contour; w2 is for image information outside of the contour; v governs the force defined by edge features; z balances the influences from the shape prior. We have selected 20 images for the training: ten images with Papanicolaou stain and ten images with Diff-Quik stain. Each of the image contains at least one pair of overlapping nuclei. Parameters are trained separately for different stains. Table. 1 lists the value of parameters trained for different stained images. Fig. 4 shows that if the process of curve evolution is dominated by edge features, the evolved contour can not arrive at the boundary of the nucleus of interest. Although a global optimal selection of parameters in terms of segmentation accuracy can not be guaranteed [15], parameters estimated by the proposed scheme can be treated as local optimal selections. To further evaluate this local optimal property of selected parameters obtained through the CRF training, we added a small perturbation to the value z 1 of these parameters by varying the ratios w w2 and v . Fig. 5(a) and fig. 5(b) show Table 1. Parameters Estimation Results From Training Stain w1 w2 v z Papanicolaou 1.0 1.0 3.25e3 8e5 Diff-Quik 1.0 2.0 1.73e3 9e5
(a)
(b)
Fig. 4. Comparative results for parameters estimation. (a) Contour of balanced forces; (b) Contour of weak constraints from shape prior.
238
X. Wu and S.K. Shah
(a)
(b) Fig. 5. Comparative segmentation results for parameters estimation. (a) Varying the 1 (b) Varying the ratio vz . ratio w w2
the segmentation accuracy across the entire training set under varying parameters. It can be seen that the selected parameters from CRF training have the best segmentation performance. To quantify the performance of the proposed method, a total of 40 testing images were selected. Half of these images were Papanicolaou stained while the other half are Diff-Quik stained. The performance of the proposed model is evaluated with two criteria, the accuracy of the segmentation in terms of the area occupied by nuclei and the number of nuclei correctly counted by each segmented region. First, we evaluated potential benefit of using non-parametric adaptive density estimation to model variations of intensities in cytological images. We found that the average segmentation accuracy using the proposed model integrated with adaptive density estimation is 81%, as opposed to 72% accuracy rate obtained by the using the Gaussian regional model integrated in the overall framework.
Level Set with Embedded CRF and Shape Priors
(a) Proposed method (b) Watershed
(c) CRF
(d) Proposed method (e) Watershed
(f) CRF
239
Fig. 6. Sample segmentation results using different models. (a)(b)(c) Papanicolaou stained sample. (d)(e)(f) Diff-Quik stained sample. Table 2. Comparison of accuracy of segmentation and nuclei counting Model Accuracy of segmentation Accuracy of nuclei counting Proposed method 81% 94% CRF 79% 52% Watershed 74% 94%
Second, we compared our method with two other popular models, watershed [14] and CRF [16]. Test results listed in Table. 2 show that the proposed model has the best overall performance in terms of segmentation and nuclei counting accuracy. The low accuracy of nuclei counting by using CRF model is expected due to its tendency to merge adjoining regions together as shown in fig. 6(c) and fig. 6(f). Compared to the CRF model, although watershed is more effective in differentiating overlapping nuclei, it has two drawbacks. First, the shape information of nuclei, a key morphological feature in cytological classification, could be distorted because of the unique labeling constraint. This is attributed to the fact that each pixel in the overlapping region could have two different labels when two nuclei overlap. Second, the model used for constructing watershed lines is dominated by the image intensity only, which is very sensitive to the intensity inhomogeneity. As a result, its segmentation accuracy in terms of the area occupied by the nuclei is lower than CRF, which models both local and neighborhood information. The proposed method inherits the modeling advantage of CRF and can model non-local dependencies during the curve evolution [7]. Because each contour evolves independently, pixels in overlapping areas can be assigned multiple labels as indicated in fig. 6(a) and fig. 6(d). Fig. 7 presents another example of segmentation results after applying the proposed method, watershed and CRF to Papanicolaou and Diff-Quik stained images, respectively.
240
X. Wu and S.K. Shah
(a) Proposed method
(b) Watershed
(c) CRF
(d) Proposed method
(e) Watershed
(f) CRF
Fig. 7. Segmentation results using different models. (a)(b)(c) Papanicolaou stained sample. (d)(e)(f) Diff-Quik stained sample.
5
Conclusion
Shape information is a key descriptor in analyzing cytological images. To recover the shape information for overlapping objects after segmentation, we extend the embedded CRF model by incorporating shape priors. In contrast to other approaches, parameters that govern the force defined by the image information and shape priors are estimated accurately through the training of discrete CRF. By evolving each contour independently, original shape information is recovered by assigning multiple labels to the common region shared by overlapping nuclei. In addition, we have presented an adaptive scheme for non-parametric density estimation based on intensity information in local regions. The proposed method is tested for the problem of segmenting overlapping nuclei with intensity inhomogeneity while preserving shape information of occluded objects.
References 1. Amrikachi, M., Ramzy, I., Rubenfeld, S., Wheeler, T.: Accuracy of fine-needle aspiration of thyroid. Arch. Pathol. Lab. Med. 125, 484–488 (2001) 2. Li, G., Liu, T., Nie, J., Guo, L., Chen, J., Zhu, J., Xia, W., Mara, A., Holley, S., Wong, S.: Segmentation of touching cell nuclei using gradient flow tracking. Journal of Microscopy 231, 47–58 (2008)
Level Set with Embedded CRF and Shape Priors
241
3. Fan, X., Bazin, P.L., Prince, J.L.: A multi-compartment segmentation framework with homeomorphic level sets. In: CVPR (2008) 4. Vese, L., Chan, T.: A multiphase level set framework for image segmentation using the mumford and shah model. IJCV 50, 271–293 (2002) 5. Rousson, M., Paragios, N.: Shape priors for level set representations. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 78–92. Springer, Heidelberg (2002) 6. Mosaliganti, K., Gelas, A., Gouaillard, A., Noche, R.: Detection of spatially correlated objects in 3d images using appearance models and coupled active contours. In: Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (eds.) MICCAI 2009. LNCS, vol. 5762, pp. 641–648. Springer, Heidelberg (2009) 7. Cobzas, D., Schmidt, M.: Increased discrimination in level set methods with embedded conditional random fields. In: CVPR (2009) 8. Levin, A., Weiss, Y.: Learning to combine bottom-up and top-down segmentation. IJCV 81, 105–118 (2009) 9. Kumar, S., Hebert, M.: Discriminative random fields. IJCV 68, 179–201 (2006) 10. Li, C., Kao, C.Y., Gore, J.C., Ding, Z.: Implicit active contours driven by local binary fitting energy. In: CVPR (2007) 11. Mittal, A., Paragios, N.: Motion-based background substraction using adaptive kernel density estimation. In: CVPR (2004) 12. Cremers, D., Osher, S.J., Soatto, S.: Kernel density estimation and intrinsic alignment for shape priors in level set segmentation. IJCV 69, 335–351 (2006) 13. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: ICCV (2001) 14. Wu, X., Shah, S.: Comparative analysis of cell segmentation using absorption and color images in fine needle aspiration cytology. In: IEEE International Conference on Systems, Man and Cybernetics (2008) 15. Szummer, M., Kohli, P., Hoiem, D.: Learning cRFs using graph cuts. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 582– 595. Springer, Heidelberg (2008) 16. Wu, X., Shah, S.: A bottom-up and top-down model for cell segmentation using multispectral data. In: ISBI (2010)
Optimal Two-View Planar Scene Triangulation Kenichi Kanatani and Hirotaka Niitsuma Department of Computer Science, Okayama University, Okayama 700-8530, Japan {kanatani,niitsuma}@suri.cs.okayama-u.ac.jp
Abstract. We present a new algorithm for optimally computing from point correspondences over two images their 3-D positions when they are constrained to be on a planar surface. We consider two cases: the case in which the plane and camera parameters are known and the case in which they are not. In the former, we show how observed point correspondences are optimally corrected so that they are compatible with the homography between the two images determined by the plane and camera parameters. In the latter, we show how the homography is optimally estimated by iteratively using the triangulation procedure.
1
Introduction
Computing the 3-D position of a point from its projection in two images is called triangulation and is a fundamental tool of computer vision [4]. The basic principle is to compute the intersection of the rays starting from the camera lens center and passing through the corresponding image points. However, point correspondence detection using an image processing operation incurs errors to some extent, and the two rays may not intersect. A naive solution is to compute the midpoint of the shortest segment connecting the two rays (Fig. 1(a)), but Kanatani [7] and Hartley and Sturm [5] pointed out that for optimal estimation the corresponding points should be displaced so that the rays meet in the scene (Fig. 1(b)) in such a way that the sum of the square displaced distances, or reprojection error , is minimized. For this, Hartley and Sturm [5] presented an algorithm that reduces to solving a 6th degree polynomial, while Kanazawa and Kanatani [13] gave a first approximation in an analytical form. Later, Kanatani et al. [11] showed that the first approximation is sufficiently accurate and that a few iterations lead to complete agreement with the Hartley-Sturm solution with far more efficiency. Lindstrom [14] further improved this approach. The aim of this paper is to demonstrate that exactly the same holds when the points we are viewing are constrained to be on a planar surface. This is a common situation in indoor and urban scenes. If a 3-D point is constrained to be on a known plane, the corresponding points must be displaced so that the rays not merely intersect but also meet on that plane. We call this planar triangulation after Chum et al. [3]. A first approximation solution was given by Kanazawa and Kanatani [12], while Chum et al. [3] presented an algorithm that reduces to solving an 8th degree polynomial. The purpose of this paper is to demonstrate that a few iterations of the first approximation lead to an R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 242–253, 2011. Springer-Verlag Berlin Heidelberg 2011
Optimal Two-View Planar Scene Triangulation
(a)
243
(b)
Fig. 1. Triangulation. (a) The midpoint of the shortest segment connecting the rays. (b) The points are optimally corrected so that their rays intersect.
optimal solution. We consider two cases: the case in which the plane and camera parameters are known, and the case in which they are not. The algorithm of Chum et al. [3] deals with the former. The latter case could be solved using the bundle adjustment approach, as demonstrated by Bartoli and Sturm [1] for an arbitrary number of images, but we show that a much simpler method exists. In fact, we show that our optimal triangulation procedure for the former case can easily be extended to the latter. In Sec. 2, we summarize the fundamentals about planar projection and homographies. In Sec. 3 and 4, we present an iterative algorithm for optimal planar triangulation for known plane and camera parameters. In Sec. 5, we show that our optimal triangulation procedure can be straightforwardly extended to the case of unknown plane and camera parameters. In Sec. 6, we do numerical simulation and demonstrate that our algorithm and the first approximation of Kanazawa and Kanatani [12] give practically the same value.
2
Planar Surface and Homography
Consider a plane with a unit surface normal n at distance d from the origin of an XY Z coordinate system fixed to the scene (Fig. 2(a)). We take images of this plane from two positions. The ith camera, i = 1, 2, is translated from the coordinate origin O by ti after rotated by Ri (Fig. 2(b)). We call {ti , Ri } the motion parameters of the ith camera. We assume that by prior camera
n
X d
Z
t1 R1
t2
R2
O Y
(a)
(b)
Fig. 2. (a) Plane and camera configuration. (b) The points are optimally corrected so that their rays intersect on the plane.
244
K. Kanatani and H. Niitsuma
(x’, y’)
(x, y)
Fig. 3. Two images of the same planar surface are related by a homography
calibration the image coordinate origin is placed at the principal point and that the aspect ratio is 1 with no image skew. The image of the plane taken from the first position, call it the “first image”, and the image taken from the second position, call it the “second image”, are related by the following homography (Fig. 3) [4, 7]: x = f0
h11 x + h12 y + h13 f0 , h31 x + h32 y + h33 f0
y = f0
h21 x + h22 y + h23 f0 . h31 x + h32 y + h33 f0
(1)
Here, f0 is a scale factor of approximately the size of the image for stabilizing numerical computation with finite length. The 3 × 3 matrix H = (hij ) is determined by the parameter {n, d} of the plane, the motion parameters {Ri , ti } and the focal lengths fi , i = 1, 2, in the following form [4, 7]: H = diag(1, 1,
f0 t1 n t2 n f1 )(I + )R1 diag(1, 1, ). )R2 (I − f2 d d − (t1 , n) f0
(2)
Here, I is the unit matrix, and diag(a, b, c) denotes the diagonal matrix with diagonal elements a, b, and c in that order. Throughout this paper, we denote the inner product of vectors a and b by (a, b).
3
Triangulation for Known Plane and Cameras
We first consider the case in which we know {n, d}, {Ri , ti }, and fi , i = 1, 2, hence the homography H. In homogeneous coordinates, we can write (1) as follow [4, 7]: ⎛ ⎞ ⎛ ⎞ ⎞⎛ x /f0 h11 h12 h13 x/f0 ⎝ y /f0 ⎠ ∼ (3) = ⎝ h21 h22 h23 ⎠ ⎝ y/f0 ⎠ . 1 h31 h32 h33 1 The symbol ∼ = denotes equality up to a nonzero constant. We can equivalently write (3) as ⎞ ⎛ ⎞ ⎛ ⎞ ⎞⎛ ⎛ h11 h12 h13 0 x/f0 x /f0 ⎝ y /f0 ⎠ × ⎝ h21 h22 h23 ⎠ ⎝ y/f0 ⎠ = ⎝ 0 ⎠ . (4) 1 h31 h32 h33 1 0 The three components of this equation multiplied by f02 are (ξ (1) , h) = 0,
(ξ (2) , h) = 0,
(ξ (3) , h) = 0,
(5)
Optimal Two-View Planar Scene Triangulation
245
S p
p
Fig. 4. The point p is orthogonally projected on to p¯ on S in the 4-D joint space
where we define the 9-D vectors h, ξ (1) , ξ (2) , and ξ (3) by h = h11 h12 h13 h21 h22 h23 h31 h32 h33 , ξ (1) = 0 0 0 −f0 x −f0 y −f02 xy yy f0 y , ξ (2) = f0 x f0 y f02 0 0 0 −xx −yx −f0 x , ξ (3) = −xy −yy −f0 y xx yx f0 x 0 0 0 .
(6)
A corresponding point pair (x, y) and (x , y ) can be identified with a point p = (x, y, x , y ) in the 4-D xyx y joint space. Each of the three equations in (5) defines a hypersurface in this 4-D joint space. However, the identity x ξ (1) + y ξ (2) + f0 ξ (3) = 0 holds, so (5) defines a 2-D variety (algebraic manifold) S in the 4-D joint space. In the presence of noise, the point p is not necessarily on S. Optimal planar triangulation is to displace p to a point p¯ on S in such a way that the reprojection error ¯ 2, (7) E = p − p is minimized subject to ¯ h) = 0, (ξ (k) (p),
k = 1, 2, 3.
(8)
Geometrically, this means orthogonally projecting p onto the variety S in the 4-D joint space (Fig. 4). Once such a p¯ = (¯ x, y¯, x ¯ , y¯ ) is obtained, the corresponding 3-D position (X, Y, Z) is determined by solving ⎛ ⎞ ⎛ ⎞ ⎞ ⎞ ⎛ ⎛ X X /f x ¯/f0 x ¯ 0 ⎜Y ⎟ ⎜Y ⎟ ⎟ ⎟ ⎝ y¯/f0 ⎠ ∼ ⎝ y¯ /f0 ⎠ ∼ (9) = P1 ⎜ = P2 ⎜ ⎝ Z ⎠, ⎝Z ⎠, 1 1 1 1 where P1 and P2 are the 3 × 4 projection matrices defined as follows [4, 11]: P1 = diag(1, 1,
f0 f0 )R1 I −t1 , P2 = diag(1, 1, )R 2 I −t1 . f1 f2
(10)
Four linear equations in X, Y , and Z are obtained from (9), but because (8) is satisfied, a unique solution is obtained [11].
246
4
K. Kanatani and H. Niitsuma
Optimal Planar Triangulation
We now present a new procedure for minimizing (7) subject to (8). While unconstrained triangulation [11] involves a single constraint describing the epipolar geometry, planar triangulation is constrained by three equations in the form of (8), not mutually algebraically independent. For this, the first approximation has already been presented by Kanazawa and Kanatani [12]. We modify their method so that an optimal solution is obtained by iterations. The procedure is as follows (the derivation is omitted; it is a straightforward extension of the unconstrained triangulation in [11]): 1. Let E0 = ∞ (a sufficiently large number), and define the 4-D vectors ⎛
⎞ x ⎜y⎟ ⎟ p=⎜ ⎝ x ⎠ , y
⎛
⎞ x ˆ ⎜ yˆ ⎟ ⎟, pˆ = ⎜ ⎝x ˆ ⎠ yˆ
⎛
⎞ x ˜ ⎜ y˜ ⎟ ⎟, p˜ = ⎜ ⎝x ˜ ⎠ y˜
(11)
˜ = y˜ = x ˜ = y˜ = 0. where we let x ˆ = x, yˆ = y, x ˆ = x , yˆ = y , and x (1) (2) (3) 2. Compute the following 9 × 4 matrices T , T , and T : ⎛
⎞ ⎛ ⎛ ⎞ ⎞ 0 0 00 f0 0 0 0 −ˆ y 0 0 −ˆ x ⎜ 0 0 00⎟ ⎜ 0 f0 0 0 ⎟ ⎜ 0 −ˆ y ⎟ y 0 −ˆ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ 0 0 00⎟ ⎜ 0 0 0 0⎟ ⎜ 0 0 0 −f0 ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ −f0 0 0 0 ⎟ ⎜ 0 0 0 0⎟ ⎜ x ˆ 0 x ˆ 0 ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ (2) ⎜ (3) ⎜ ⎟ ⎟ ˆ yˆ 0 ⎟ T (1) =⎜ ⎜ 0 −f0 0 0 ⎟, T =⎜ 0 0 0 0 ⎟ , T =⎜ 0 x ⎟. ⎜ 0 0 00⎟ ⎜ 0 0 0 0⎟ ⎜ 0 0 f0 0 ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ yˆ 0 0 x ⎜ −ˆ ⎜ 0 0 0 0 ⎟ ˆ⎟ x 0⎟ x 0 −ˆ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎝ 0 yˆ 0 yˆ ⎠ ⎝ 0 −ˆ ⎝ 0 0 0 0 ⎠ y 0⎠ x −ˆ 0 0 0 f0 0 0 −f0 0 0 0 0 0
(12) 3. Compute the following ξ (1)∗ , ξ (2)∗ , and ξ (3)∗ : ˆ −f0 yˆ −f02 x ˆyˆ yˆyˆ f0 yˆ ˜ + T (1) p, ξ (1)∗ = 0 0 0 −f0 x (2)∗ (2) 2 ˆ f0 yˆ f0 0 0 0 −ˆ xx ˆ −ˆ yxˆ −f0 x ˆ ˜ ξ = f0 x + T p, yyˆ −f0 yˆ x ˆx ˆ yˆx ˆ f0 x ˆ 0 0 0 + T (3) p. ˜ xyˆ −ˆ ξ (3)∗ = −ˆ (kl)
4. Compute the following 9 × 9 matrices V0 (kl)
V0
(13)
[ξ]:
[ξ] = T (k) T (l) .
5. Compute the 3 × 3 matrix W = (W ) ⎛ ⎞− (11) (12) (13) (h, V0 [ξ]h) (h, V0 [ξ]h) (h, V0 [ξ]h) ⎜ ⎟ W = ⎝ (h, V0(21) [ξ]h) (h, V0(22) [ξ]h) (h, V0(23) [ξ]h) ⎠ , (31) (32) (33) (h, V0 [ξ]h) (h, V0 [ξ]h) (h, V0 [ξ]h) 2
(14)
(kl)
(15)
where ( · )− r denotes pseudoinverse constrained to rank r (the smallest eigenvalue is replaced by 0 in its spectral decomposition).
Optimal Two-View Planar Scene Triangulation
247
6. Update p˜ and pˆ as follows: p˜ =
3
W (kl) (ξ (l)∗ , h)T (k) h,
˜ pˆ ← p − p.
(16)
k,l=1
7. Evaluate the reprojection error E by ˜ 2. E = p
(17)
If E ≈ E0 , then return pˆ as p¯ and stop. Else, let E0 ← E and go back to Step 2. The use of the pseudoinverse constrained to rank 2 in (15) reflects the fact that only two of the three constraints in (5) are algebraically independent. This algorithm produces the same solution as that of Chum et al. [3], which solves an 8th degree polynomial. However, our algorithm involves only a few iterations of linear calculus without requiring any polynomial solver, which is sometimes inefficient and numerically unstable.
5
Triangulation for Unknown Plane and Cameras
Next, we consider the case in which {n, d}, {Ri , ti }, i = 1, 2, are unknown. As is well known [4, 7], these parameters can be estimated by computing the homography H between the two images, provided that the focal lengths fi are known; we assume that they are given by prior camera calibration. Since the analytical procedure for computing the plane and camera parameters from a homography H has been described by many researchers in different forms [6, 15, 18–20], the problem reduces to computing the homography H from point correspondences over the two images. The simplest method is to minimize the algebraic distance, which is known by many names such as least squares and DLT (Direct Linear Transformation) [4], but the accuracy is low in the presence of noise. A method known to be very accurate is what is called Sampson error minimization [4], and an iterative scheme was presented by Scoleri et al. [17]. However, Sampson error minimization does not necessarily compute an exactly optimal solution in the sense of maximum likelihood [9]. We now show that by combining the Sampson error minimization with the optimal planar triangulation described in Sec. 4, we can obtain an exactly optimal H. The basic principle is already presented by Kanatani and Sugaya [9], but their theory applies only for a single constraint, which is the case in ellipse fitting and fundamental matrix computation. Here, we extend their procedure to homographies constrained by multiple equations. Given N corresponding points (xα , yα ) and (xα , yα ), α = 1, ..., N , we compute the 9-D vector h that encodes the homography H. Omitting the derivation, we describe the procedure: 1. Let E0 = ∞ (a sufficiently large number), and give an initial guess of h in (6) using any method, say least squares.
248
K. Kanatani and H. Niitsuma
2. Let pα , pˆα , and p˜α be the vectors in (11) for the αth pair (xα , yα ), and (xα , yα ), α = 1, ..., N . (1) (2) (3) 3. Let Tα , Tα , and Tα be, respectively, the values of T (1) , T (2) , and T (3) in (12) for the αth pair, α = 1, ..., N . (kl) (k) (l) and the 3 × 3 matrices 4. Compute the 9 × 9 matrices V0 [ξα ] = Tα Tα (kl) (1)∗ (2)∗ Wα = (Wα ) in (15) for the αth pair, and let the 9-D vectors ξα , ξα , (3)∗ and ξα be the values of ξ (1)∗ , ξ (2)∗ , and ξ (3)∗ in (13) for the αth pair, α = 1, ..., N . 5. Compute the 9-D unit vector h that minimizes J=
N 3 1
Wα(kl) (ξα(k)∗ , h)(ξα(l)∗ , h). N α=1
(18)
k,l=1
6. Update p˜α and pˆα , α = 1, ..., N , as follows: p˜α =
3
Wα(kl) (ξα(l)∗ , h)Tα(k) h,
pˆα ← pα − p˜α .
(19)
k,l=1
N 7. Evaluate the reprojection error E = α=1 p˜α 2 . If E ≈ E0 , then return h and stop. Else, let E0 ← E and go back to Step 3. This procedure is identical to the optimal planar triangulation in Sec. 4 except for Step 5. The expression in (18) coincides with what is known as the Sampson (k)∗ (l)∗ (k) error [4] if ξα and ξα on the right-hand side are respectively replaced by ξα (l) and ξα (the values of (6) for the αth pair). It can be minimized by the scheme of Scoleri et al. [17], but here we use a much simpler reformulation of Kanatani et al. [16], which is a direct extension of the FNS (Fundamental Numerical Scheme) of Chojnacki et al. [2]. The procedure goes as follows: 1. Provide an initial value h0 for h, e.g., by least squares. 2. Compute the matrices M and L as follows: M=
N N N 1 (k) (l) (kl) 1 (kl) (k)∗ (l)∗ Wα ξα ξα , L = vα vα V0 [ξα ], N α=1 N α=1
(20)
k,l=1
where we define vα(k)
=
3
Wα(kl) (ξα(l)∗ , h).
(21)
l=1
3. Solve the eigenvalue problem (M − L)h = λh, and compute the unit eigenvector h for the smallest eigenvalue λ. 4. If h ≈ h0 , return h and stop. Else, let h0 ← h, and go back to Step 2.
(22)
Optimal Two-View Planar Scene Triangulation
249
For unconstrained triangulation, the optimal algorithm of Kanatani et al. [11], which assumes a given fundamental matrix, can be automatically converted to optimal fundamental matrix computation merely by inserting a Sampson error minimization step, as shown by Kanatani and Sugaya [8]. In contrast, the polynomial solving algorithm of Hartley and Sturm [5] cannot be so easily converted to optimal fundamental matrix computation. Similarly, the optimal planar triangulation in Sec. 4, which assumes a given homography, can be automatically converted to optimal homography computation merely by inserting a Sampson error minimization step. In contrast, the polynomial solving algorithm of Chum et al. [3] cannot be so easily converted to optimal homography computation.
6
Experiments
Figure 5(a) shows two images of a simulated grid. The image size is 500 × 500 pixels; the focal lengths are f1 = f2 = 600 pixels. We added Gaussian noise of mean 0 and standard deviation 1 pixel to the x and y coordinates of each of the N (= 121) grid points. Then, we reconstructed the 3-D position of each grid point by unconstrained triangulation [11] and by our planar triangulation. Figure 5(b) shows the 3-D positions of the grid points. We can see that by planar triangulation (in blue) all the points are on the specified plane but not by unconstrained triangulation (in red). For quantitative evaluation, we measured the root mean square reprojection error
N
1
yα − yα )2 + (ˆ xα − xα )2 + (ˆ yα − yα )2 , (ˆ xα − xα )2 + (ˆ (23) e= N α=1 xα , yˆα ) are the corrected positions of the observations (xα , yα ) where (ˆ xα , yˆα ) and (ˆ and (xα , yα ), respectively. We also evaluated the 3-D reconstruction error
N
1
D= rˆα − r¯α 2 , (24) N α=1 where rˆα is the reconstructed position of the αth point, and r¯α its true position. The table in Fig. 5 lists the values for unconstrained triangulation [11], the first approximation of the planar triangulation (the iteration is terminated after the first round), which corresponds to the result of Kanazawa and Kanatani [12], and the exact values computed by our method. From this table, we observe that the reprojection error e increases by assuming planarity. This is because the corresponding points need to be displaced so that the rays not simply intersect but also intersect on the specified plane. Statistical analysis [7] tells us that under maximum likelihood N e2 /σ 2 is subject to a χ2 distribution with N degree of freedom for unconstrained triangulation and with 2N degrees √of freedom for planar triangulation. Hence, e should be approximately σ and 2σ, respectively. The values in the table in Fig. 5 are very close to the prediction. However, the increase in the reprojection error e does
250
K. Kanatani and H. Niitsuma
(a)
(b)
reprojection theoretical 3-D error expectation error unconstrained 1.99503 2.00000 4.24466 planar 1st approx. 2.83127 — 0.95937 exact 2.83124 2.82843 0.95935 Fig. 5. (a) Simulated images of a planar grid taken from different places. (b) 3-D position of the reconstructed grid. Points reconstructed by planar triangulation (in blue) are on the specified plane, but those reconstructed by unconstrained triangulation [11] (in red) are not necessarily on it. The table below lists the reprojection error, its theoretical expectation, and the average 3-D reconstruction error. reprojection theoretical 3-D error expectation error known 2.83124 2.82843 0.01919 unknown 1st approx. 2.81321 — 0.13643 exact 2.81323 2.78128 0.13660 plane
Fig. 6. The reconstructed grid by estimating the plane and the camera positions (in red) and its true position (in black). The table lists the reprojection error of planar triangulation by estimating the plane and the camera positions, its theoretical expectation, and the average 3-D reconstruction error.
not mean the increase in the 3-D reconstruction error D. In fact, the 3-D reconstruction error D actually decreases with the knowledge of planarity. We also see that the exact values are very close to the first approximation of Kanazawa and Kanatani [12]. Next, we tested the case in which the plane and camera parameters are unknown. Since the absolute scale is indeterminate, we scaled the relative displacement between the cameras to unit length. Figure 6 shows the 3-D positions of the reconstructed grid (in red) and its true position (in black). Due to the error in estimating the plane, i.e., the homography, the computed position is slightly different from its true position. The table in Fig. 6 compares the reprojection error e and the 3-D reconstruction error D in the known and unknown plane cases. The values in the known plane case are the same as in the table in Fig 5 except the normalization t = 1.
Optimal Two-View Planar Scene Triangulation
251
We observe that the reprojection error e is smaller in the unknown plane case than in the known plane case. This is because the parameters of the plane are estimated so that the reprojection error is minimized. Statistical analysis [7] tells us that under maximum likelihood N e2 /σ 2 is subject to a χ2 distribution with 2N − 8 degrees of freedom and hence has expectation 2N − 8. This reflects the fact that the homography constraint has eight degrees of freedom with codimension two [7]. Consequently, the reprojection error e should approximately be 2(1 − 4/N )σ. The value in the table in Fig 5 is very close to the prediction. We also see that the first approximation (using only a single Sampson error minimization step) and the exact maximum likelihood value are very close to each other, as generally predicted in [9]. Again, the smaller reprojection error does not mean more accurate 3-D reconstruction. Rather, the 3-D reconstruction accuracy deteriorates because of the error in estimating the plane, as shown in the table in Fig. 5. Note that when the camera positions are unknown, the 3-D positions of the points cannot be reconstructed without the knowledge of planarity. If the points are in general position, their 3-D positions and the camera positions can be reconstructed from two views [10], but that computation fails if the points degenerate to be coplanar [4, 7].
7
Concluding Remarks
We have presented an optimal algorithm1 for computing the 3-D positions of points viewed from two images by using the knowledge that they are constrained to be on a planar surface. This is an extension of the unconstrained triangulation of Kanatani et al. [11] without assuming planarity. Our algorithm automatically encompass the case in which the plane and camera parameters are unknown; they are estimated merely by inserting a Sampson error minimization step. As a result, an exact maximum likelihood estimate is obtained for the homography between the two images. This is a complete parallel to the scheme of Kanatani and Sugaya [8] for computing an exact maximum likelihood estimate of the fundamental matrix between two images merely by inserting a Sampson error minimization step in the unconstrained triangulation of Kanatani et al. [11]. In contrast, the optimal triangulation of Hartley and Sturm [5], which solves a 6th degree polynomial, is not so easily converted to optimal fundamental matrix estimation. Similarly, the optimal planar triangulation of Chum et al. [3], which solves an 8th degree polynomial, is not so easily converted to produce an optimal homography. We have also confirmed experimentally that the first approximation is very close to the exact maximum likelihood estimate. Thus, we conclude that our optimal scheme is not really necessary in practice. In fact, that the Sampson error minimization solution is known to coincide with the exactly optimal solution up to several significant digits in many problems [9]. In our experiment, we used the least squares, also known as the DLT, to compute the initial value h to start the 1
The code is available at http://www.suri.cs.okayama-u.ac.jp/~kanatani/e/
252
K. Kanatani and H. Niitsuma
FNS procedure described in Sec. 5. We have observed that the Sampson error minimization iterations may not converge in the presence of extremely large noise and that the use of “HyperLS” or its “Taubin approximation” [16] can significantly extend the noise level range of convergence. Our optimal homography computation does not reach a global minimum of the reprojection error if the Sampson error minimization in Step 5 does not return a global minimum of the Sampson error. In fact, the FNS procedure described in Sec. 5 is not theoretically guaranteed to return a global minimum h of the Sampson error J, although it usually does. However, if the Sampson error J can be globally minimized, e.g., using branch and bound, in each iteration, we can obtain a global minimum solution h in the end, because the Sampson error J coincides with the reprojection error when the iterations have converged [9]. Acknowledgments. This work was supported in part by the Ministry of Education, Culture, Sports, Science, and Technology, Japan, under a Grant in Aid for Scientific Research (C 21500172).
References 1. Bartoli, A., Sturm, P.: Constrained structure and motion from multiple uncalibrated views of a piecewise planar scene. Int. J. Comput. Vis. 52(1), 45 (2003) 2. Chojnacki, W., Brooks, M.J., van den Hengel, A., Gawley, D.: On the fitting of surfaces to data with covariances. IEEE Trans. Patt. Anal. Mach. Intell. 22(11), 1294–1303 (2000) 3. Chum, O., Pajdla, T., Sturm, P.: The geometric error for homographies. Comput. Vis. Im. Understand. 97(1), 86–102 (2002) 4. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 5. Hartley, R.I., Sturm, P.: Triangulation. Comput. Vision Image Understand. 68(2), 146–157 (1997) 6. Hay, J.C.: Optical motions and space perception—An extension of Gibson’s analysis. Psych. Rev. 73(6), 550–565 (1966) 7. Kanatani, K.: Statistical Optimization for Geometric Computation: Theory and Practice. Elsevier, Amsterdam (1996); reprinted, Dover, York (2005) 8. Kanatani, K., Sugaya, Y.: Compact fundamental matrix computation. IPSJ Trans. Comp. Vis. Appl. 2, 59–70 (2010) 9. Kanatani, K., Sugaya, Y.: Unified computation of strict maximum likelihood for geometric fitting. J. Math. Imaging Vis. 38(1), 1–13 (2010) 10. Kanatani, K., Sugaya, Y., Kanazawa, Y.: Latest algorithms for 3-D reconstruction form two views. In: Chen, C.H. (ed.) Handbook of Pattern Recognition and Computer Vision, 4th edn., pp. 201–234. World Scientific, Singapore (2009) 11. Kanatani, K., Sugaya, Y., Niitsuma, H.: Triangulation from two views revisited: Hartley-Sturm vs. optimal correction. In: Proc. 9th British Machine Vis. Conf., September 2008, pp. 173–182 (2008) 12. Kanazawa, Y., Kanatani, Y.: Direct reconstruction of planar surfaces by stereo vision. IEICE Trans. Inf. & Syst. E78-D(7), 917–922 (1995) 13. Kanazawa, Y., Kanatani, K.: Reliability of 3-D reconstruction by stereo vision. IEICE Trans. Inf. & Syst. E78-D(10), 1301–1306 (1995)
Optimal Two-View Planar Scene Triangulation
253
14. Lindstrom, P.: Triangulation made easy. In: Proc. IEEE Conf. Comput. Vis. Patt. Recog. (June 2010) 15. Longuet-Higgins, H.C.: The reconstruction of a planar surface from two perspective projections. Proc. Royal Soc. Lond. B-227(1249), 399–410 (1986) 16. Niitsuma, H., Rangarajan, P., Kanatani, K.: High accuracy homography computation without iterations. In: Proc. 16th Symp. Sensing via Imaging Information (June 2010) 17. Scoleri, T., Chojnacki, W., Brooks, M.J.: A multi-objective parameter estimator for image mosaicing. In: Proc. 8th Int. Symp. Signal Processing and its Applications, August 2005, vol. 2, pp. 551–554 (2005) 18. Tsai, R.Y., Huang, T.S.: Estimation of three-dimensional motion parameters of a rigid planar patch. IEEE Trans. Acoustics Speech Signal Process. 29(6), 1147–1152 (1981) 19. Tsai, R.Y., Huang, T.S.: Estimation of three-dimensional motion parameters of a rigid planar patch III: Finite point correspondence and the three-view problem. IEEE Trans. Acoustics Speech Signal Process. 32(2), 213–220 (1984) 20. Tsai, R.Y., Huang, T.S., Zhu, W.-L.: Estimation of three-dimensional motion parameters of a rigid planar patch II: Singular value decomposition. IEEE Trans. Acoustics Speech Signal Process. 30(4), 525–544 (1982)
Pursuing Atomic Video Words by Information Projection Youdong Zhao1 , Haifeng Gong2 , and Yunde Jia1 1
School of Computer Science, Beijing Institute of Technology, Beijing 100081, China 2 GRASP Lab., University of Pennsylvania, Philadelphia, PA 19104, USA {zyd458,jiayunde}@bit.edu.cn,
[email protected]
Abstract. In this paper, we study mathematical models of atomic visual patterns from natural videos and establish a generative visual vocabulary for video representation. Empirically, we employ small video patches (e.g., 15×15×5, called video “bricks”) in natural videos as basic analysis unit. There are a variety of brick subspaces (or atomic video words) of varying dimensions in the high dimensional brick space. The structures of the words are characterized by both appearance and motion dynamics. Here, we categorize the words into two pure types: structural video words (SVWs) and textural video words (TVWs). A common generative model is introduced to model these two type video words in a unified form. The representation power of a word is measured by its information gain, based on which words are pursued one by one via a novel pursuit algorithm, and finally a holistic video vocabulary is built up. Experimental results show the potential power of our framework for video representation.
1
Introduction
There is a long history of research on texton or image primitive [1–3]. Image patches have played a significant role in recognition. In recent literature, video patches have attracted more and more attentions [4, 5]. However, they mainly focus on human action analysis by designing discriminative features. There lacks a systematic analysis and holistic view for the distribution of atomic motion patterns under a unified modeling framework. And it was never clear in vision what are the atomic video elements. In this paper, we study mathematical models of atomic visual patterns from natural videos and establish a generative visual vocabulary for video coding and representation. Empirically, we employ small video patches (e.g., 15 × 15 × 5, called video “bricks”) in natural videos as basic analysis unit. Fig. 1 shows some examples. From a mathematical perspective, these bricks embedded in the high (1125) dimensional space, form a variety of clusters of varying dimensions and structures. We adopt the notion “atomic video words” for these clusters which can be used as building blocks for video coding and representation. Our objective is two-fold: (i) Analyzing the space distribution of video bricks and dividing into 2 types under a common generative model; (ii) Studying a pursuit algorithm for R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 254–267, 2011. Springer-Verlag Berlin Heidelberg 2011
Pursuing Atomic Video Words by Information Projection sketch
flat
texture
Edge
Bar
N/A
N/A
water
grass/ leaves
crowded
trackable
motion
N/A
lighting
N/A
intrackable
still
appearance motion
255
Fig. 1. The types of appearance and motions of video bricks
selecting the 2 type video words step by step in the brick space and building up a visual vocabulary for video representation and modeling. These small video bricks are less affected by compositionality and thus remain “pure”. They are characterized by two factors: appearance and motion. Fig. 1 illustrates the catalogue of video bricks. According to appearance they are divided into three types: (i) flat patch; (ii) structural primitives, for object boundaries[1, 2]; and (iii) texture, for complex visual patterns which are often represented by MRF models [6]. Motion also divide them into three types: (i) still; (ii) trackable motion, such as moving attributed points [7], bars/edges [8], or articulated body [9] and (iii) intrackable motion, such as crowded flying birds and water flowing, which are often called dynamic texture [10, 11]. Another factor characterizing bricks is lighting changes, whose effect is low dimensional [12], e.g., an image patch under vehicle lights in surveillance video [13]. According to above categories of bricks, we categorize subspaces in the brick space into two types: structural video words (SVWs) and textural video words (TVWs). A SVW is spanned by one (or few) large base vectors, e.g. to represent a moving primitive, and a TVW is constrained by certain statistical properties– marginal histograms shared by all bricks in the subspace. As shown in Fig.2, bricks from a video sequence may fall in a small number of subspaces. That is, a video can be represented by some subspaces certainly. We introduce a common generative model for the two type words, which are pursued by information projection. The representation power of a word is measured by information gain, based on which words are pursued one by one, and finally a holistic video vocabulary is built up. This vocabulary could be applied for many video analysis tasks. Extensive experimental evaluations show the promising applications of the proposed framework. Our work is related to the work in [3], which maps the natural image patches into two type manifolds in terms of appearance. In our work, appearance and motion dynamics of spatiotemporal video patches are considered simultaneously.
256
Y. Zhao, H. Gong, and Y. Jia
Structural video brick
Brick Space
Textural video brick
Fig. 2. Map bricks from a video onto subspaces in the brick space: structural video words (SVWs) and textural video words (TVWs)
New modeling methods are introduced to capture the words efficiently. Our work is important for the rapidly increasing video analysis tasks. The proposed work is also related to dynamic texture [10], distributed oriented energy representation [14], and mixture of dynamic textures [15]. While they success in some video tasks, they all only employ one type model and do not account for the important structure information explicitly. We adopt two complementary models taking the informative structure part into account explicitly. The remainder of this paper is organized as follows: in Section 2, the formulation of the atomic video words is presented. After that we introduce how to model the two type words and pursue the video words in Section 3. The experimental results are presented in Section 4. We conclude in Section 5.
2
Formulation of the Atomic Video Word
We fix the size of brick, denoted by B, as 15 × 15 pixels in space, and 5 frames (about 200ms) in time, so that each brick contains a pure, not mixed, appearance or motion. Definition 1. A structural video word (SVW) is a subspace spanned by one (or few) basis functions, n αi Bi + }, (1) Ωsvw = {B : B = i=1
where the basis functions Bi are selected (pursued) from a spatio-temporal primitive bank Δsvw = {Bi }N i=1 , αi is the reconstruction coefficients and is the residual. The spatio-temporal primitive bank Δsvw is moving Gabor filters. We use large scale 2D Gabor filters at 8 orientations with size of 13 × 13 pixels. They move perpendicular to their orientations with speeds at 0, ±1, ±2, ±3, ±4, ±5, and 8 pixels. So we obtain N = 8 × 12 = 96 spatio-temporal basis functions.
Pursuing Atomic Video Words by Information Projection
257
Definition 2. A textural video word (TVW) is a subspace of brick that share a set of common constraints, Ωtvw = {B : Hi (B ∗ Fi ) = Hi∗ + , i = 1, 2, · · · , l}, (2) where Fi is selected from a filter bank Δtvw = {Fi }N i=1 , (· ∗ ·) denotes the convolution operator, Hi (B ∗ Fi ) is histogram of the filter responses over all pixels in a brick B, Hi∗ is mean histogram of the cluster, and is the fluctuation. We use small scale 2D Gabor filters at 8 orientations with size 7 × 7 in pixel. They move perpendicular to their orientations with speeds at 0, ±1, ±2, ±3, and 5 pixels. So we again obtain N = 8 × 8 = 64 spatio-temporal Gabor filters. In addition, the responses of filters are normalized against neighborhood for comparability. For each word Ω k ∈ {Ωtvw , Ωsvw } , there is an underlying distribution fk (B) to describe it. The objective of modeling is to learn a probability model pk (B) to approach the fk (B), which can be realized by minimizing the Kullback-Leibler divergence, fk (B) KL(fk pk ) = Efk log . (3) pk (B) Here, we give the common generative model for the two types of words: n 1 exp − λi ri (B) qk (B) , pk (B|λ, Δ) = Z(λ) i=1
(4)
n
where Δ = {Bi or Fi }i=1 are selected basis set, λ = (λ1 , · · · , λn ) are the corresponding parameters, Z (λ) is the partition function, and qk (B) is reference distribution. If the word is from Ωsvw , the selected basis set Δ = {Bi }ni=1 ⊂ Δsvw , ri (B) = max{Sigmoid(|B ∗ Bi |2 )} (i.e., ri (B) is the maximum of responses modulated by Sigmoid transform in B), and qk (B) usually is the distribution estimated offline from natural scene bricks. This model is similar to active basis model [16], with the difference in spatiotemporal filters. We can simply measure how well a brick B belongs to Ω k by Scoresvw = i ri (B). If the word is a TVW, the selected filter set Δ = {Fi }li=1 ⊂ Δtvw , ri (B) = Hi (B ∗ Fi ) − Hi∗ 2 , Hi∗ is the mean histogram pooled over the subspace, and qk (B) usually is the uniform distribution. How B belongs to Ω k could also be simply measured well a brick tvw ∗ 2 = i Hi − Hi . by Score
3
Atomic Video Words Pursuit
Based on the above definitions, in this section, we will show how to pursue the SVWs and TVWs in a common generative framework.
258
3.1
Y. Zhao, H. Gong, and Y. Jia
Pursuing the Model of a Word by Information Projection
Given a set of bricks sampled from a pure subspace (or word) Ω k , its probability model defined in Eq.(4) could be iteratively pursued in the following two steps: (I) Learning the parameter λ+ . As a new feature B+ or F+ is introduced, the corresponding parameter λ+ is computed by minimizing the KL divergence in Eq. (3). For notation simplicity, we denote the current model as p and the augmented model as p+ as illustrated in Fig.3. Then the optimized augmented model is p∗+ = arg min KL (f p+ ) , (5) p+
subject to the statistical constraint: k
|Ω | 1 Ep+ [r+ ] = Ef [r+ ] = r + = k r+ (Bi ), |Ω | i=1
(6)
where r+ is an abbreviation for r+ (B), which is the response of feature B+ or F+ on B. Solving the above constrained optimization, we obtain the augmented model: 1 p (B) exp {−λ+ r+ (B)} , (7) p+ (B) = Z (λ+ ) where λ+ is calculated so that the new constraint (Eq.(6)) is satisfied and Z(λ+ ) = Ep [exp{−λ+ r+ }]. If we keep B+ or F+ independent of the previously selected features, we have Ep [exp{−λ+ r+ }] = Eq [exp{−λ+ r+ }], where q is the reference distribution (mentioned in Section 2). Then, due to log-linear form of Eq.(7), λ+ can be solved by a look-up table of λ+ versus Ep+ [r+ ]. Z(λ+ ) also could be computed in a similar way. p Ð+ = fp : Ep[r+(B)] = Ef [r+(B)]g
p+
f KL(f jp) = KL(f jp+) + KL(p+jp)
Fig. 3. Sketch map for the feature pursuit process
(II) Select a new feature B+ or F+ . Among all the candidate features Δsvw = {Bi } or Δtvw = {Fi }, the best parameters λ of each feature are computed as discussed in above Step (I), and as shown in Fig.3 the best feature B+ or F+ is chosen so that it maximizes the the information gain KL(p+ p): p+ def ∗ ∗ B+ or F+ = arg max KL(p+ p) = arg max Ep+ log = gain(B+ or F+ ). p (8)
Pursuing Atomic Video Words by Information Projection
259
Because of the response of a new feature r+ is a one-dimensional constraint and the features are pursued one by one, the information gain can be computed as Ep+ [log pp+ ] = −λ+ Ep+ [r+ ]−log Z(λ+ ). For SVWs, λ is negative, so the above information gain suggests selecting the feature B+ with largest mean response Ep+ [r+ ]. For TVWs, λ is positive, the suggested feature F+ achieving the small∗ 2 est sample variance of Ep+ [H+ (B ∗ F+ ) − H+ ]. This learning process is called information projection. Each time, it projects the model to a new constrained space specified by Eq.(6) to augment the model. In practice, for computational efficiency of TVWs, instead of pursuing feature one by one, we simplify the modeling problem by using all the features in Δtvw to form a long histogram feature (called implicit representation) for each brick. And the distance from each brick to the mean histogram of the cluster is defined as |Δtvw | tvw Hi (B, Fi ) − Hi∗ 2 = H(B) − H ∗ 2 , (9) r (B) = i=1 t t (H1t , . . . , HΔ ) tvw
where H(B) = and H ∗ is the mean histogram. Then the new probability model for a TVW is pk (B) =
1 1 q(B) exp −λr tvw (B) = q(H) exp −λH − H ∗ 2 = pk (H). Z(λ) Z(λ) (10)
We simply assume that pk (H) is Gaussian, and that q(H) is uniform. pk (H) is a multivariate Gaussian, but we assume that the variance-covariance matrix of it is of such a low rank that we could essentially treat it as a univariate Gaussian along its major principal axis with variance σ 2 . So in this case, λ = √ |Ω k | k | i=1tvw H(Bi ) − H ∗ 2 . This 1/2σ 2 and Z(λ) = 2πσ 2 , where σ 2 = 1/|Ωtvw approximation is computational efficiency and reasonable, which is shown in Section 4.1. 3.2
Pursuing Atomic Video Words by Hybrid Modeling
Given millions of bricks sampled from a large number of natural scene videos, there are various TVWs and SVWs embedded in the high dimensional brick space. In terms of the above model definitions, we employ an EM-like algorithm to learn the mixture of TVWs and the mixture of SVWs over the whole samples respectively, which is initialized by a hierarchial k-means clustering scheme [17]. For TVWs, the learning process is called implicit modeling and 1 2 N , Ωtvw , . . . , Ωtvw }. the captured mixture of TVWs is denoted as Ωtvw = {Ωtvw And explicit modeling indicates the learning procedure of mixture of SVWs 1 2 M Ωsvw = {Ωsvw , Ωsvw , . . . , Ωsvw }. Ωtvw and Ωsvw are two candidate sets of atomic video words and there are heavy overlaps in space between them. Our goal is to learn a video vocabulary Ω = {Ω1 , Ω2 , . . . , ΩK } containing various atomic words with few overlapping and covering the whole brick space. This procedure is called hybrid modeling, which is discussed in following.
260
Y. Zhao, H. Gong, and Y. Jia
Let f (B) be the underlying true distribution governing the natural video bricks ensemble Ωnat .
Ksvw k Ktvw k Ωnat = Ωsvw , Ωtvw k=1 , Ω0 = {Ω0 , Ω1 , . . . , ΩK } , (11) k=1 where Ksvw + Ktvw = K and Ω0 is the remaining uniform distribution. The probability model for Ωnat is a mixture distribution: p (B|Λ, Δ) =
K
wk pk (B|Λ; Δ),
(12)
k=0
where wk and pk (B|Λ, Δ) are the weight and probability model (discussed above) of the k-th word, Λ is the parameter set, and Δ = {Δsvw , Δtvw }. We now learn the mixture model p (B|Λ, Δ) in Eq.(12) to approach f (B) by minimizing the Kullback-Leibler divergence: f (B) KL(f p) = Ef log (13) = −Ef [log p (B)] + Const. p (B) This minimization amounts to maximizing Ef [log p(B)] , which can be decom
posed as K wk pk (B) (14) arg max Ef [log p (B)] = arg max Ef log p
p
≥ arg max p
K k=0
k=0
wk Ef [log pk (B)] ≈ arg max p
K
wk EΩk [log pk (B)] ,
k=0
where wk is approximated by frequency, and the approximation holds because overlapping between subspaces is neglectable. The mean score EΩk [log pk (B)] over Ωk can be computed by pk (B) pk (B) EΩk [log pk (B)] = Epk log + log q(B) = Epk log + Const.(15) q(B) q(B) In order to avoid overfitting during minimizing KL(f p) that equivalents to maximum likelihood estimation, we introduce a constraint of model complexity − k αwk log wk in Eq.(14), where α is a positive parameters (e.g., α = 0.1). So the information gain of a SVW or TVW is pk (B) (16) − αwk log wk , lk = wk Epk log q(B) where
n pk (B) Epk log −λi Epk [ri ] − log Z(λi ). = q(B) i=1
(17)
As the number of words is small and the frequency of a word is very high, this introduced constraint term −αwk log wk is a monotonously decreasing functional w.r.t. wk and encourages splitting the word. On the contrary, it becomes a monotonously increasing functional and favors merging this word with others.
Pursuing Atomic Video Words by Information Projection
261
Algorithm 1. Video words pursuit by hybrid modeling 1 2 M 1 2 N Input: Ωsvw = {Ωsvw , Ωsvw , . . . , Ωsvw } and Ωtvw = {Ωtvw , Ωtvw , . . . , Ωtvw } Output: Ω = {Ω1 , Ω2 , . . . , ΩK }
1 2 3 4 5 6
7
Initalize Ω = ∅, K ← 0; repeat s s Select theΩsvw or Ωtvw with highest gain gmax calculated by Eqs. (16) and (17). for each Ω k ∈ Ωsvw , Ωtvw do Remove copies of Bi from all clusters if Bi ∈ Ω s ; Recalculate the model of clusters overlapping with Ω s in two manners and select the one with higher gain; until gmax < t, or the sample set is ∅, or Ωtvw and Ωsvw = ∅ ;
The video words pursuit algorithm is summarized in Algorithm 1. This algorithm essentially ranks all the words in Ωtvw and Ωsvw by their information gains, and removes overlaps along the way bay favoring the word that achieves a greater amount of information gain. This pursuit process may be viewed as seeking the words with the largest frequency and smallest entropy at each step. Through this stage, we obtain a genetic video dictionary for video coding and modeling. For example, we could use it for video representation, e.g., as a high dimensional word frequency vector, or segmentation of videos composed of various appearances and motions, which will be shown in the next experiment section. It is worth noting that the whole process of our framework is unsupervised and scalable for larger video dataset.
4
Experiments
To evaluate the proposed method, we construct three types of experiments: (i) Dynamic texture segmentation. We show the encouraging representation power of our implicit representation, which also indicates the simplification and Gaussian assumption in Section 3.1 are reasonable. (ii) Generic video vocabulary learning. By learning a generic video vocabulary using the hybrid modeling, we get some insights in the distribution of natural video bricks and show representation power of two type words by an embedding experiment. (iii) Video scene recognition. Inspired by the above promising representation, a real video scene recognition is conducted, which supplies a numerical comparison among related methods and shows the superiority of our method to others. 4.1
Dynamic Texture Segmentation by Implicit Representation
We first compare our implicit representation optical flow, pixel filter response and pixel synthetic sequences as shown in Fig. 4. The is a 64-dimensional L1-normalized histogram
with three others, i.e., traditional intensity, on two complementary implicit representation of a brick of filter responses pooled over the
262
Y. Zhao, H. Gong, and Y. Jia Frames
Our Results
Optical Flow
Filter Response
Pixel Intensity
r=0.9362
r=0.7480
r=0.5480
r=0.5304
r=0.9385
r=0.7186
r=0.5029
r=0.5007
Fig. 4. Segmentation by changing dynamics and texture orientation. In row 1, the two dynamic textures are identical in appearance, but differ in the dynamics (speeded up by a factor of 2). In row 2, segmentation of two dynamic textures which differ only in spatial orientation (rotated square area by 90 degrees) but share the same dynamics and general appearance. 0.06
0.06 Background Froeground
0.04 0.03 0.02 0.01 0
Background Foreground
0.05 Normalized Responses
Normalized Responses
0.05
0.04 0.03 0.02 0.01
0
10
20
30 40 Features
50
60
(a) Changing dynamics
70
0
0
10
20
30 40 Features
50
60
70
(b) Texture orientation
Fig. 5. Four mean histograms correspond to the segmentation results of our implicit representation in Fig.4. The foreground indicates the square region in the center. It is apparent while the bg/fg share many common statistics, they are distinct by the dynamic speeds or texture orientation.
small brick volume1 . Each dimension corresponds to a filter mentioned in Section 2. The pixel filter response means we utilize the 64 filter responses in each pixel to characterizing the pixel directly. The commonly used mean-shift algorithm [18] is adopted as the segmentation method. There is one mean-shift bandwidth parameter h needed to be tuned. In Fig. 4, while the same segmentation scheme is employed, only our implicit representation successes in separating the two sequences. Fig. 5 supplies an intuitive explanation that how the proposed implicit representation is capable of dynamic texture segmentation. We also compare our implicit representation with state-of-the-arts2 on a public dataset [15] as shown in Fig. 6. The results of DTM [15] on these examples require knowledge of the number of textures as well as hand contour initialization. The LDT [19] is a graphical model that has significantly larger computational 1 2
Note that you could also divide the brick volume into several spatiotemporal cells, compute histogram in each and get a longer description. The results compared are available at http://www.svcl.ucsd.edu/projects/layerdytex/synthdb/
Pursuing Atomic Video Words by Information Projection Frames
Our Results
LDT
DTM
263
GPCA
texture _010 r=0.8899
r=0.7454
r=0.8604
r=0.5060
r=0.5442
r=0.5028
texture _011 r=0.5001
r=0.5002
Fig. 6. Comparison with the state-of-the-art dynamic texture segmentation methods
(a)
(c)
(e)
(b)
(d)
(f)
(g)
t
Ă Normalized Responses
0.05 0.04 0.03 0.02 0.01
10
20
30 40 Features
50
60
Fig. 7. Test on two real challenging scenes
requirements. While the state-of-the-arts fail in separating the foreground/ background on two dynamic texture examples, the proposed method exhibits successful segmentation results. The implicit representation is also evaluated on two real scenes in Fig.7. Scene 1 contains rich dynamic textures with different spatial orientations, moving speeds and motion directions. Though there is no ground truth of segmentation for this scene, the result (in Fig.7(b)) is reasonable and meaningful. In scene 2, a human is passing a river with fast-flowing via a cableway. The human is separated from the fast-flowing river successfully in Fig.7(d). For comparison, we perform the hybrid modeling framework in this scene, and the results are illustrated in Fig. 7(e). The whole brick space of this sequence is modeled by two TVWs (fast-moving water and human body) and five SVWs (cable, human legs, etc.) as shown in Fig.7(f) and (g). It is apparent that the structural information lost by above mean-shift scheme is captured and modeled well in the hybrid framework. From this result, it seems that the learnt hybrid vocabulary is a better representation for the video scene in semantic and maybe competent for some high-level visual tasks, e,g., video scene recognition. 4.2
Video Representation by Hybrid Vocabulary
In this experiment, 200 scene video sequences are collected (samples shown in Fig. 1). All these sequence are compressed by Xvid coder with 25 fps, resized to be 310 × 190 and persisting 3 seconds at least. We use 100 videos for video
264
Y. Zhao, H. Gong, and Y. Jia
Fig. 8. Word types in the dictionary. Red indicates TVWs and blue indicates SVWs
1.5
3
1.5
TVP SVP
2.5
TVP SVP
2
TVP SVP
1
1
0.5
0.5
1.5 1 0.5 0
0
5
10
15
20
25
30
0
0
5
10
15
20
25
30
0
0
5
10
15
20
25
30
Fig. 9. Three video sequences and information gains of the top 30 words in them
vocabulary learning. For each sequence, 60 frames are selected and converted to gray, from which 138, 225 bricks are extracted. Totally, near 1.4 × 107 bricks are collected and used. Through the hybrid modeling, a video dictionary composed of 1983 words are obtained. Fig. 8 shows the word types in the dictionary, i.e., red color indicating the textural video words and blue color the structural video words. The TVWs are mainly distributed in the high information gain region, and the SVWs are converse. In the middle region the numbers of two type words are comparative. In the first 100 words, there are only four structural video words, which are static edge, bar, and two moving edges. The highest information gain word is a cluster in which the bricks are near flat. The distribution of information gains of these words is decreasing monotonously whereas the frequency is not. This indicates there exist clusters that distribute sparsely than others. Once we obtain a ranked video dictionary, a video sequence could be represented in the fashion of Bag-of-Words. Owning to the two complementary types of words, we call the representation using our vocabulary as Bag-of-2-Words (Bo2Ws). As in document analysis, the high frequency words in a document often imply the document type. Here, we analyze this phenomena in videos by the video dictionary. Fig.9 shows us three video sequences and corresponding statistics of the top30 words appearing in them respectively. In the crowded video (1st column), there are mainly TVWs and several SVWs appearing on the edges. The SVWs are rich in the third simple video. In the middle complex scene (2nd column), two type bricks appear in turn. The top30 words in each video simply classify them from each other. Here, we only take two type words into account in general but do not further consider more details in each word, e.g., texture type, structure type or motion direction. In order to illustrate the representation power of the video dictionary intuitively, we design a visualization experiment using Isomap algorithm [20]. We
Pursuing Atomic Video Words by Information Projection
265
4
7
x 10
52
6
133
5
54
4 35 71 51 78 140 126 103 107
3 2
72
134 45 9 141 6593 138 31 42 55 60 537429 4711933 18 115 130 117 68 36 161 129 108 731334 104 37 81 118 8156 16 94 12 152 158 67 145 153 92 3146 19 116 30 157 159 155 144 91 28 23 164 84 70 25 120 88 162 163 46 124 40 111 89 149 20 123 26 39 56 7150 151 64 66 17 50 160 113 44 106 57 125 90 114 15 109 59 1131 10 24 41 148 127 22 128 43 14 139 38 100 101 21 10 95 142 99 48 77 102 79 49 82 154 122 69 27 132 6 143 112 2 5 147 63137 121 76 8580 1 32 87 61 98 86 96 105 97 135 83 11 4 62 136
1 0
75 58
-1 -2 -3 -10
-8
-6
-4
-2
0
2
4 4
x 10
Fig. 10. Isomap visualization of distribution of test videos by our vocabulary
select 164 video sequences for testing. Some of them share the same scenes as training videos but at a different time interval. Each testing video is quantized as a frequency vector of 1, 983 dimensions. All the 164 videos are projected into a 2 dimensional space by Isomap algorithm (in Fig.10). The digit near each point is the number of the corresponding video. Samples of 9 video scenes are shown, which correspond to eight points (75, 34, 130, 155, 71, 54, 133, and 52) along the curve and the 136th point respectively. The video complexity is increasing along the curve and most of the videos are distributed in the middle region. The 136th video is a Marathon scene with camera still, and the translational motion of the crowded people is simpler than the random motion (e.g., in the 52nd video). This agrees with the fact that most of the testing videos are in middle complexity and few videos are in extremely low or high complexity. 4.3
Video Scene Recognition
In order to numerically study the power of vocabulary learnt by the proposed method, we conduct a video scene recognition task on the public Hollywood Scene Dataset established by Marszalk et al. [21]. They first evaluate on it. Their scene classifiers are based on a bag-of-words video representation and classification with SVM. Average precision (AP), which approximates the area under a recall-precision curve, is employed as the measurement of performance. They use simultaneously static (SIFT) and dynamic features (HOG/HOF) for video representation and show that both static and dynamic features are important for recognition of scenes. Following the evaluation framework in [21], we compare four different video vocabularies which are learnt by simple K-means clustering (as baseline), implicit modeling, explicit modeling and hybrid modeling. Specifically, we first extract spatiotemporal interesting points randomly. Then we learn video vocabularies by above four mentioned methods. For K-means clustering, we use the implicit
266
Y. Zhao, H. Gong, and Y. Jia
Average precision (AP)
0.8
0.6
Kmeans Implicit-Modeling Explicit-Modeling Mixed Modeling Marszalek's Method
0.4
0.2
0 EXT-house EXT-road INT-bed.
INT-car INT-hotel INT-kit.
INT-liv. INT-office INT-rest. INT-shop
Fig. 11. Results of video scene recognition on Hollywood Scene dataset Table 1. Mean Average Precisions of Five Methods over Ten Scenes Methods K-means Im-modeling Ex-modeling Hy-modeling Marszalk et al.’s Mean AP 0.336 0.358 0.321 0.415 0.351
representation as descriptions of bricks. Finally, SVM classifiers with χ2 -kernel are used to classify samples using one-against-rest strategy. The results are shown in Fig.11 and Table1. Note that due to the simplification in Section 3.1, implicit modeling is similar to Kmeans clustering method, and it is not surprising that they achieve similar results. For most of the scenes, implicit modeling using only TVWS obtains better results than explicit modeling except the “INT-Car” and “INT-LivingRoom”. This may be because in the two scenes there are more structural parts and the SVWs are more distinctive. The hybrid modeling combining two type words in a common generative framework achieves improved performance and is superior to Marszalk et al.’s method which also combines different type features. Our results further confirm that SVWs and TVWs are two complementary atomic video patterns and both important for video representation. This is consistent with the observations of Marszalk et al’s work. Note that different types of features are combined by a unified generative model in our method, unlike Marszalk et al.’s method that fuses different features in a SVM framework using multi-channel Gaussian kernel.
5
Summary
In this paper, we have presented a unified framework for characterizing various video bricks to generate a “panorama map” of words in the brick space. These words are “weighted” (by information gain) and pursued under an information projection framework. A video dictionary containing two complementary types of words is obtained for video coding and modeling. We show some applications of the video word representation and achieve promising results compared with the state-of-the-arts. These video words are atomic structures to construct higher level video patterns through composition. In future work, we will study how words are composed to form larger motion patterns, such as human actions, in larger image patches and longer time sequences.
Pursuing Atomic Video Words by Information Projection
267
Acknowledgement. This work was done at Lotus Hill Research Institute and was supported by the Natural Science Foundation of China (NO. 90920009 and NO. 60875005) and the Chinese High-Tech Program (NO. 2009AA01Z323).
References 1. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996) 2. Zhu, S.C., Guo, C.E., Wang, Y.Z., Xu, Z.J.: What are textons? IJCV (2005) 3. Shi, K., Zhu, S.C.: Mapping natural image patches by explicit and implicit manifolds. In: CVPR (2007) 4. Doll´ ar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS (2005) 5. Shechtman, E., Irani, M.: Space-time behavior-based correlation. PAMI (2007) 6. Zhu, S.C., Wu, Y.N., Mumford, D.: Filters, random-fields and maximum-entropy (frame): Towards a unified theory for texture modeling. IJCV 27, 107–126 (1998) 7. Veenman, C., Reinders, M., Backer, E.: Resolving motion correspondence for densely moving points. PAMI (2001) 8. Olshausen, B.A.: Learning sparse, overcomplete representations of time-varying natural images. In: ICIP (2003) 9. Wu, Y., Hua, G., Yu, T.: Tracking articulated body by dynamic markov network. In: ICCV (2003) 10. Soatto, S., Doretto, G., Wu, Y.: Dynamic textures. In: ICCV (2001) 11. Wang, Y., Zhu, S.C.: Modeling textured motion: Particle, wave and sketch. In: ICCV (2003) 12. Belhumeur, P., Kriegman, D.: What is the set of images of an object under all possible illumination conditions? Int. Journal of Computer Vision 28, 245–260 (1998) 13. Zhao, Y.D., Gong, H., Lin, L., Jia, Y.: Spatio-temporal patches for night background modeling by subspace learning. In: ICPR (2008) 14. Derpanis, K.G., Wildes, R.P.: Early spatiotemporal grouping with a distributed oriented energy representation. In: CVPR (2009) 15. Chan, A.B., Vasconcelos, N.: Modeling, clustering, and segmenting video with mixtures of dynamic textures. PAMI 30, 909–926 (2008) 16. Wu, Y.N., Si, Z., Fleming, C., Zhu, S.C.: Deformable template as active basis. In: ICCV (2007) 17. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2006) 18. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. PAMI (2002) 19. Chan, A.B., Vasconcelos, N.: Layered dynamic textures. PAMI 31, 1862–1879 (2009) 20. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science (2000) 21. Marszalk, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)
A Direct Method for Estimating Planar Projective Transform Yu-Tseh Chi1 , Jeffrey Ho1 , and Ming-Hsuan Yang2 1
2
Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, U.S.A. {ychi,joh}@cise.ufl.edu Electrical Engineering and Computer Science, University of California at Merced, Merced, CA 95344, U.S.A.
[email protected]
Abstract. Estimating planar projective transform (homography) from a pair of images is a classical problem in computer vision. In this paper, we propose a novel algorithm for direct registering two point sets in R2 using projective transform without using intensity values. In this very general context, there is no easily established correspondences that can be used to estimate the projective transform, and most of the existing techniques become either inadequate or inappropriate. While the planar projective transforms form an eight-dimensional Lie group, we show that for registering 2D point sets, the search space for the homographies can be effectively reduced to a three-dimensional space. To further improve on the running time without significantly reducing the accuracy of the registration, we propose a matching cost function constructed using local polynomial moments of the point sets and a coarse to fine approach. The resulting registration algorithm has linear time complexity with respect to the number of input points. We have validated the algorithm using points sets collected from real images. Preliminary experimental results are encouraging and they show that the proposed method is both efficient and accurate.
1
Introduction
The importance of 2D projective transforms (planar homographies) in computer vision stems from the well-known fact that different images of a planar scene are related by planar projective transforms. For vision applications that deal with planar objects, examples of which include recognizing hand drawings and traffic signs, etc., estimating the projective transform between two images is required for normalizing the image variation due to different cameras. Accordingly, there are now well-established techniques for automatically recovering planar homography between images [1]. For these feature-based methods, the idea is to register two sets of interest points P = {p1 , · · · , pk } and Q = {q1 , · · · , qk } extracted from the two images using projective transforms. Intensity values (local intensity patterns) provide important clues on establishing correspondences across images, R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 268–281, 2011. Springer-Verlag Berlin Heidelberg 2011
A Direct Method for Estimating Planar Projective Transform
269
Fig. 1. Four examples for which a direct application of SIFT [2] fails to produce correct correspondences. The first two images are hand drawings on a whiteboard, and the others are images of street signs.
and Direct Linear Transform (DLT) requires only four correspondences in order to compute a homography. Implicitly, the algorithms define a discrete search space S that contains a large number of projective transforms (from all possible correspondences made across the images), and the image intensity pattern is used to prune away most part of S that are irrelevant, and the algorithms exhaustively test each remaining homography according to some matching cost function to obtain the optimal homography. In this paper, we study the 2D projective registration problem in its barest form: using only point sets without intensity values. While DLT requires only four correct correspondences, there are cases where it is difficult to correctly determine these correspondences automatically. Figure 1 shows four pairs of such images that cannot be registered satisfactorily using existing methods. The lack of salient local intensity patterns often confuses algorithms that match local feature descriptors. Furthermore, local geometric structure such as curvatures are not reliable features for computing correspondences since they could vary wildly under projective transforms. Nevertheless, we can still solve the problem by sampling a collection of points from each image and register the point sets using projective transforms1 . A set of four points (or more) can be chosen, and a homography search space S can be defined as consisting homographies generated by all correspondences for these four points. For each homography in S, its registration quality can be evaluated by a matching cost function such as Γ (H; P, Q) =
l j=1
min |qj − H(pi )|2 .
i=1,··· ,k
(1)
For point sets containing 500 points, S contains roughly 65 billion homographies. While this is by no mean a small number, it is still within the capability of modern hardware to exhaustively search S for the optimal homography. However, the quartic time complexity makes this approach unattractive since if 5000 points were sampled instead, the search space would increase enormously to the point that renders the algorithm impotent, even though the two problems do not different in any essential way other than the number of points (they have the same images). The approach advocated in this paper is to abandon the usual paradigm of defining the search space S using small number of correspondences, which is often impossible for us anyway. Instead, the search space S should be defined entirely 1
Or we can register the binary image obtained from the segmented image as in [3].
270
Y.-T. Chi, J. Ho, and M.-H. Yang
by the point sets in some global way. The main contribution of this paper is to show that, under a fairly general assumption on the variability of the homographies, we can define an effective three-dimensional (discrete) search space S for projective registration of two point sets in R2 . The size of S is independent of the sizes of the input point sets, and in many cases, it is possible exhaustively search for the optimal homography in S. Therefore, the time complexity of the algorithm depends on the complexity of evaluating the matching cost function. For Γ defined in Equation 1, the complexity is quadratic since it requires the computation of pairwise distances. However, by using a less expensive matching cost function constructed from local polynomial moments of the point sets, we show that it is possible to have an accurate projective registration algorithm with linear time-complexity. The projective registration method proposed in this paper is direct in its truest sense: the algorithm takes in two point sets, which can be readily generated from images using low-level image processing, and outputs a projective transform that matches them. No other extra mid- or high-level processing such as contour detection or local feature extraction are necessary. Since the algorithm exhaustively evaluates every homography in the search space S, it can provide a reasonable guarantee on the quality of the registration. We conclude this introduction with a summary of three main contributions of this paper: 1. We show that the search space S for planar projective matching can be effectively reduced to a three-dimensional discrete search space, and the resulting algorithm has a quadratic O(n2 ) time-complexity on the size of the input point sets, which is a substantial improvement over the existing quartic complexity. 2. We show that the time-complexity of the matching algorithm can be further reduced to O(n) by using matching cost function constructed from local polynomial moments. 3. We develop and implement a coarse to fine registration approach. We validate the proposed method using simple point sets and points sets collected from real images, and encouraging preliminary results suggest the method’s potential for future applications.
2
Related Works
Estimating homography from pairs of images has been studied quite extensively in the literature. In particular, [1] contains a chapter on feature-based methods using Direct Linear Transform (DLT) and RANSAC [4]. The feature-based algorithms typically extract a large number of interest points from the images, and the tentative correspondences are computed across images by matching descriptors associated to the interest points. A large number of hypothetical homographies are generated, each from a small number of tentative correspondences, and evaluated with some matching cost function. The homography that provides the best registration result is used as the initial guess in a non-linear optimization algorithm that form the second refinement step.
A Direct Method for Estimating Planar Projective Transform
271
While the feature-based methods have become the standard repertoire of computer vision algorithms, they are clearly inadequate for handling point sets without the associated intensity values since the tentative correspondences cannot be established as easily without local intensity patterns. It seems that not much work has been done on matching points directly using projective transforms. Perhaps the difficulty of the problem is best illustrated by the non-existence result for finite projective invariants shown in [5]. However, in [6], a set of infinite projective invariants has been proposed, and some of its implication have recently been investigated in the work of [3, 7] on registering binary images. The method proposed in [3] is very general and it works with binary images obtained from segmented images. The algorithm matches the values of a collection of integrals, which can then be turned into a collection of independent equations. The non-linear minimization in [3] is solved using Levenberg-Marquardt algorithm, which in general, unfortunately, does not guarantee convergence to the global minimum. Finally, we remark that the affine registration of point sets have been studied quite thoroughly in the literature e.g., [8–17], and many of these algorithms have found important applications in vision, graphics and other areas. In addition, nonrigid point-set registration has also received fair amount of attention, particularly in the medical imaging community, e.g., [18, 19].
3
2D Projective Registration
Let P = {p1 , · · · , pk } and Q = {q1 , · · · , ql } we are seeking a 2D planar homography H ⎛ h11 h12 H = ⎝ h21 h22 h31 h32
denote the two point sets for which ⎞ h13 h23 ⎠ 1
(2)
that will align them. The entry h33 is set to one and it corresponds to the assumption that the planar homography does not map the origin to the line at infinity. Assume for the moment that the two point sets have equal number of points k = l, and the correspondences under H is exact: qi = H(pi ), for 1 ≤ i ≤ k. Given H as above, the formulas for computing the transformation q = H(p) (given p = [px , py ], q = [qx , qy ]) are qx =
h11 px + h12 py + h13 , h31 px + h32 py + 1
qy =
h21 px + h22 py + h23 . h31 px + h32 py + 1
(3)
For a fixed pair of (h31 , h32 ), the above formulas become qi =
where A=
1 1 A pi + t, αi αi
h11 h12 h21 h22
,
t=
(4) h13 h23
,
272
Y.-T. Chi, J. Ho, and M.-H. Yang
and the point-dependent constant αi equals the denominator in Equation 3. When h31 = h32 = 0, ai = 1 for all i and Equation 4 reduces to an affine transform. For any other pair h31 , h32 , Equation 4 defines a transform between ˜ = {p1 /α1 , · · · , pk /αk } and Q that appears superficially the modified point set P similar to an affine transform except for the non-uniform translation t/αi . The pair h31 , h32 , the non-affine components of H, will play an important role in the following discussion. The basic question we strive to answer in this paper is: Suppose the ranges of the two non-affine components are known, σ1 ≤ h31 ≤ σ2 and σ3 ≤ h32 ≤ σ4 for some σ1 , σ2 , σ3 , σ4 . How should this information be utilized in 2D projective registration? Our main observation, which motivates our approach, is that for a fixed pair of h31 , h32 , registration using Equation 4 can be effectively reduced to a search problem on a compact and closed one-dimensional subset Oh31 ,h32 of GL(2, R). This is a straightforward generalization of a well-known result for affine transforms described below. Now armed with this result, our method is very easy to described: Given the range of the two non-affine components and some specified precision requirement for the homography matrix, we can form a 2D grid H of values for h31 , h32 . For each (h31 , h32 ) ∈ H, we can also discretize its corresponding one-dimensional set Oh31 ,h32 , and this provides us with a three-dimensional discrete search space S ≡ { sijk } such that each sijk ∈ S corresponds to a 2D projective transform. Our algorithm simply does an exhaustive search on S to locate the optimal homography. 3.1
Special Case of Affine Transforms
For an affine registration, qi = Api + t, the translation component t satisfies the relation mQ = AmP + t, where mQ , mP are the centers of mass of P, Q, respectively. Once the points are centered: pi → pi − m P ,
qi → qi − mQ ,
we have only the linear component qi = Api . The affine transform must match the two covariance matrices SP =
k 1 pi pti , k i=1
SQ =
k 1 t qi q , k i=1 i
and SQ = ASP At .
Since both SP , SQ are positive-definite, their square roots exist, and the equation above implies that 1 1 −1 −1 I2×2 = (SQ 2 ASP2 )(SQ 2 ASP2 )t . −1
1
That is, the matrix SQ 2 ASP2 is an orthogonal matrix, i.e., a rotation. In partic−1
−1
ular, under the coordinates transform pi → SP 2 pi , and qi → SQ 2 qi , the transformed point sets are now related by an orthogonal transform. Note that the transformed point sets now have the identity matrix as their covariance matrices.
A Direct Method for Estimating Planar Projective Transform
273
The significance of this result in the context of affine registration is that after the two-steps normalization that turns each point set into a point set with zero mean and unit variance, the affine registration is reduced to a rigid registration: finding an orthogonal transform (rotation) in O(2), which is a onedimension subset (subgroup as well) of GL(2, R). Essentially, this normalization step requires that the affine transform exactly matches the linear and quadratic moments of the two point sets [7], and this constraint reduces the original 4D problem to an 1D problem, which is considerably easier to solve. Once the rotation has been determined, we can “unwrap” it to get the full affine transform. This result can be generalized directly to the projective case, and the main complication comes from the non-uniform translation in Equation 4. The basic idea is still the same in that the transform specified in Equation 4 should match the linear and quadratic moments of the two point sets. The result of this more general reduction will still be an one-dimensional subset of GL(2, R), but not a subgroup. Nevertheless, it can be effectively parameterized using trigonometric functions, and hence efficiently discretized as well. 3.2
1D Reduction
The main result of this subsection is the following proposition Proposition 1. Suppose P, Q are two sets of points in R2 related exactly by a projective transform. There exist coordinates transforms pi → p˜i ,
qi → q˜i ,
(5)
and a vector t, a matrix E, a diagonal matrix D and constants β1 , · · · , βk de˜ Q ˜ are termined entirely by P, Q such that the new sets of transformed points P, related by the following transform q˜i = Ap˜i + βi t,
(6)
where the matrix A satisfies the following equation ADAt + AE + Et At + ttt = I2×2 .
(7)
Proof: We will prove this proposition in a somewhat leisure way by exhibiting all the steps involved in the coordinates transforms, pi → p˜i , qi → p˜i . Suppose the unknown projective transform is given by Equation 4. We can make the transform pi → α1i pi to absorb the constant α1i : qi = Api +
1 t. αi
(8)
k Taking the average of both sides gives mQ = AmP + αt, where α = k1 i=1 α−1 i . Combining the two equations above yields qi = Api + βi (mQ − AmP ), where βi = 1/(ααi ).
274
Y.-T. Chi, J. Ho, and M.-H. Yang
Make another transform pi → pi − βi mP , we have qi = Api + βi mQ . Let −1
SP , SQ denote the covariance matrices, and the transforms pi → SP 2 pi and −1
qi → SQ 2 qi turn the above equation into −1
1
−1
1
−1
qi = SQ 2 ASP2 pi + βi SQ 2 mQ ≡ Api + βi t, −1
where we have let A = SQ 2 ASP2 and t = SQ 2 mQ . k Since now i=1 qi qit is the identity, we have I2×2 = ASAt + AE + EtAt + ttt k k where S = i=1 pi pti , and E = i=1 βi pi tt . Furthermore, since S is symmetric and positive semi-definite, it has an eigen-decomposition S = UDUt . Where D is a diagonal matrix with non-negative entries and U is an orthogonal matrix. If we define A ≡ AU, which is equivalent to the coordinates transform pi → Ut pi , we have finally ADAt + AE + Et At + ttt = I2×2 , where E ≡ Ut E. For each pair of h31 , h32 , the set of matrices A satisfying Equation 7 is the set Oh31 ,h32 mentioned earlier. The following proposition will show that it is a one-dimensional subset of GL(2, R). As is clear from the above derivation, the matrices E, D, the vector t and the constants βi , all can be immediately computed from the input points given h31 , h32 . Once the transformed point sets have been registered using Equation 6 with A satisfying Equation 7, it is straightforward to unwrap the above steps to recover the homography matrix for the original point sets exactly as in the affine case. What is left now is an effective parameterization for the matrices A that satisfy Equation 7, which is given by the following Proposition.
ab Proposition 2. The components of the matrix A = are determined by cd the matrices E, D, T = ttt and θ, θ: rP cos θ − E sP sin θ − F rP cos θ − G sP sin θ − H ,b = ,c = ,d = , r s r s
√ E G RT r √0 where E = ,T= ,D= F H T S s 0 2 2 2 2 and P = 1 − R + Er + Fs , Q = 1 − S + Gr + Hs . FH The two angles θ, θ satisfy cos(θ − θ) = ( GE r + s − T )/P Q. a=
(9)
Proof: The proof is straightforward. Expanding the matrix equation (Equation 7), we have (10) 1 = ra2 + sb2 + 2aE + 2F b + R, 1 = rc2 + sd2 + 2cG + 2dH + S,
(11)
0 = rac + sbd + cE + dF + aG + bH + T
(12)
A Direct Method for Estimating Planar Projective Transform
275
Completing the squares in the first two equations gives √ √ E ( ra + √ )2 + ( sb + r √ √ G ( rc + √ )2 + ( sd + r
F2 F E2 √ )2 = 1 − R + + ≡P r s s H G2 H2 √ )2 = 1 − S + + ≡Q r s s
(13)
This gives the four equations in Equation 9. Equation 12 gives the relation between θ, θ: Substituting the four equations in Equation 9 into Equation 12 and collecting terms, we have GE FH P Q cos θ cos θ + P Q sin θ sin θ − 2 − 2 + T = 0, r s which implies that GE FH 2 + s2 − T cos(θ − θ) = cos θ cos θ + sin θ sin θ = r . (14) PQ The two propositions together provide us with an efficient way to define the three-dimensional search space S mentioned earlier. For a given fixed S, the complexity of the algorithm depends on the matching cost function. For example, Equation 1 gives a quadratic time complexity algorithm since it requires the computations of pairwise distances. To further reduce the time complexity in our current setup, a computationally less expensive matching cost function is required. 3.3
Matching Cost Function Using Local Moments
Inspired by the recent works on using moments for registration [7, 20], we proposed a simple and efficient matching cost function constructed directly from local polynomial moments. The (square) region containing the target point set Q is divided into four subregions R1 , R2 , R3 , R4 . We compute a vector descriptor DnQ (Ri ) for each region Ri using the averaged value of monomials of degree n evaluated at the points of Q lying in Ri . For example, using the two linear monomials, the descriptor D1Q (Ri ) = N1i [ j xj , j yj ]. The sum is over all qi ∈ Ri and moments, Ni is the number of points in Ri . The two components are the linear and for quadratic moments, we have D2Q (Ri ) = N1i [ j x2j , j xj yj , j yj2 ]. In general, for degree-n moments, the corresponding descriptor vector has n + 1 components DnQ (Ri ) = N1i [ j xnj , · · · , j xkj yjn−k , · · · , j yjn ]. The descriptors can be defined for any point set, and in particular, we can defined a matching 4 cost function based on comparing these four local moments, Γ n (H; P, Q) = i=1 |DnQ (Ri ) − DnH(P) (Ri )|2 . Where H(P) denotes the image of P under H. For example, with linear moments, the matching algorithm will try to match the center of mass of the points lie in each region. Unfortunately, the linear moments, while very easy to compute, do not encode much geometric information, and matching them provides the weakest constraint for the registration. However, moments of higher-degree, such as quadratic and cubic moments, offer several possibilities, and matching them in the least-squares sense as above provide an informative and inexpensive way to compare the point
276
Y.-T. Chi, J. Ho, and M.-H. Yang
sets locally. We remark that, because the moments defined above are averaged values, they are generally insensitive to the change in the number of points. Finally, it is clear that without computing pairwise distances, the matching cost function Γ n has a linear time complexity. 3.4
Determine the Range for h31 , h32
Ideally we would like to determine the range for h31 , h32 automatically from the point sets. Unfortunately, we are not aware of any method that can accomplish this. However, since our method is geared towards vision applications, we propose to estimate the ranges empirically using two different methods. First, we gather many pairs of images and manually select correspondences and compute the homographies. The range for h31 , h32 can be determined directly from these empirical data. For the second method, we use the formula for homographies developed in [1] (page 327). For two cameras with camera matrices C1 = K[ I | 0 ], and C2 = K [ R | t ] and a plane π with homogeneous coordinates [nt d]t . The homography matrix is given by the formula H = K (R − tnt /d)K −1 . Where d is the distance between the plane π and the first camera. We can sample with respect to different matrices K, K using different focal lengths and also different rotations. Further simplifications can be done by assuming that d >> |t|, and the homography matrix is approximated as H ≈ K RK−1 . This gives the infinite homography between the two cameras [1]. In our simulation with 100, 000 samples, we randomly sample over a range of focal lengths (from 200 to 1000) and rotation matrix R, while keeping the cameras’ skew to zero and principal points at the origin. The simulation shows that less than 0.1% of h31 , h32 have absolute values greater than one. This contrasts sharply with the results for other components such as h11 , which has about 50% with magnitude greater than one. The simulation shows that for applications that require registrations of point sets using homographies that is close to infinite homographies, the search range for h31 , h32 can be reasonable set to be between −5 and 5. 3.5
Refinement
Although the 1D reduction method proposed in sectin 3.2 make the exhaustive homography estimation possible, it consumes a significant amount of time in order to obtain a very precise homography estimation. Therefore, we propose to use looser intervals of h31 , h32 to have an approximate estimation first. The exhaustive homography estimation in this phase yields a new points set P = H(P) which is reasonably close to the target points set Q. In the refinement phase, we take advantage of this geometry constraint to try to correspond a point q in Q only to points from P in its local neighborhood. To build the correspondences, we first compute the normalized orientation histograms of points in Q and P . These normalized histograms are used as local shape descriptor and used to compute the similartiy between a point in Q and its local neighboors in P . A normalized orientation histogram is computed as
A Direct Method for Estimating Planar Projective Transform
277
follows. For a point p in a points set P, we apply Principle Component Analysis on it local neighborhood points set N( circles in Figure 2), which is a subset of P. The principle component vector c of the local neighborhood surrounding p give us the major direction (green lines in Figure 2) in which the points are spreaded. We then build the orientation histogram based on the angle between vector c and vector pj , where pj is the vector formed by point p and the j-th point in N. In our experiments, we used six bins for orientation histogram. Building the histogram based on the angle between the vector pj and the principal component instead of x axis provides a rotationally invariant measurement between points in Q and P . As shown in Figure 2, straight lines do not provide discriminative information for estimating homography. The points with histogram entropy values e less than 0.8 are omitted to speed up the correspondences building process. Entropy is 6 h(i)log(h(i)), where h is the normalized histogram. calculated using e = −Σi=1 After phase 1, points set Q, and P = H(P) should be resonalbly close. Therefore the histogram of a point q in Q is only compared to its local neighbooring points pi from P . The similarity of q and pi is measured by using the L2 distance of their histograms. A point pi is said to be corresponding to q if i = arg minhq − hpi , where hq and hpi are the histograms of q and pi , rei
spectively. After obtaining a collection of correspondences, we utilize RANSAC to estimate the refined homography (Hr ). The matching cost function in the RANSAC process is the Hausdorff distance [21] between Q and P .
Fig. 2. The normalized orientation histogram of a point is built based on the angles between vectors pj and its principle component c (green lines). The two bar charts are the normalized orientation histograms for points A and B, respectively. Points with low normalized histogram entropy value, point B for example, are omitted in the process of building correspondences.
4
Experiments and Results
We applied our proposed method on the points sets collected from two different types of images:
278
Y.-T. Chi, J. Ho, and M.-H. Yang
Fig. 3. The original pairs of images are shown in the first two columns. Column 3 displays the comparisons of the point sets of two images. Column 4 shows the comparisons of the transformed points set and the points of the images in second column. Column 5 shows the blending of the transformed image of the first column and the image in the second column.
Planar Shapes: In the first set of experiments, we drew several shapes on a typical classroom whiteboard. A pair of images were taken for each shape with different camera viewpoints (and focal lengths as well). Adaptive thresholding method [22] is applied to segment the images, and the point sets are then randomly sampled from the foreground pixels of resulting binary images. Three other pairs of images were taken from street scenes and are shown in the first three rows in Figure 4. Canny edge detector is applied to compute edges. Point sets were then randomly sampled from the resulting edges. Complex scene: Instead of gathering points sets from the contours of planar shapes or simple hand-drawing shapes, we also applied SIFT [2] on the graffiti images (Figure 3) to gather points. We adjusted threshold values such that the number of SIFT points from the two images are similar. We then applied our algorithm on the gathered points sets to estimate the homography matrix. We implemented the proposed algorithm using C++. We first search the optimal h31 and h32 from −5 to 5 with Δd = 0.1, Δθ = 1◦ . Once an optimal pair of (h31 , h32 ) has been found, we apply the refinement algorithm on the points sets Q and H(P). We define local neighborhood to be a range of radius 25 pixels (circles in Figure 2) around a point for calculating orientation histogram and for building correspondences between Q and H(P). 4.1
Results
Our experiments were conducted using a computer with an Intel T7300 2.0 GHz Core 2 Duo processor. The average time taken for estimating the homography in phase 1 are 33.51 seconds. The fourth column in Figure 4 and Figure 3 demonstrates the comparisons of the points sets Q and H(P). The first row of Table 1 shows the errors (Hausdorff distance [21]) in pixel between Q and H(P) (with images size 800 × 600). The average time taken in phase 2 the refined homography estimation is 43.20 seconds. The last column in Figure 4 demonstrates the comparisons of points set Q and Hr (H(P)), where Hr
A Direct Method for Estimating Planar Projective Transform
279
RCY:
TRN:
DSB:
ARW:
FOX:
CAR:
HUS:
PCM:
FSH: Fig. 4. The first two columns are the images which the points sets are sampled from. Column 3 demonstrates the comparisons of point sets Q and P. Column 4 demonstrates the results the exhaustive homography estimation (Q and H(P)). Column 5 demonstrates the results after the refinement step (Q and Hr (H(P))).
280
Y.-T. Chi, J. Ho, and M.-H. Yang
Table 1. The errors in pixels after phase 1 and 2 of the points sets listed in Figure 4
phase 1 phase 2
RCY 10.49 4.26
TRN 10.47 5.25
DSB 13.25 6.93
ARW 7.85 1.30
FOX 6.77 1.72
CAR 5.49 2.76
HUS 8.92 3.00
PCM 6.30 1.60
FSH 9.86 2.29
is the refined homography estimation obtained in phase 2. These images clearly demonstrate that the proposed method in phase 2 is able to obtain a more accuate estimation. The results in the third row of Table 1 also suggest a siginificant imporvement.
5
Conclusion and Future Work
We have proposed a novel registration algorithm for matching 2D point sets related by an unknown planar projective transform. From the two input point sets and the specified range for the two non-affine components of the homography matrix, the proposed algorithm defines a discrete three-dimensional space of homographies that can be searched exhaustively for the optimal homography transform. We also showed that a matching cost function can be constructed using local polynomial moments, and the resulting registration algorithm has linear time-complexity with respect to the size of the point sets. To increase the efficiency, we chose a slightly coarse search range for h in phase 1. The homography estimated in phase 1 was able to efficiently bring the two points sets reasonably close to each other. Utilizing the orientation histograms, we were able to identify feature points and to build the correspondences between Q and P . The homography between these two points sets was then estimated using RANSAC. Preliminary experimental results are encouraging, and they have shown the proposed algorithms are indeed capable of efficiently and accurately estimating the projective transform directly from the point sets without other extra inputs such as intensity values. We plan to further investigate in the new direction we have proposed in this paper. In particular, we will investigate in more details the sensitivity of the algorithm with respect to the sampling intervals (Δd, Δθ), and also the possibility of a GPGPU implementation since the code can be highly parallelizable. However, the most pressing unresolved issue is the question of whether it is possible to determine, in some meaningful way, the range for the two non-affine components directly from the input point sets. Perhaps the infinite projective invariants proposed in [6] can provide some clues on how to solve this difficult problem.
References 1. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003) 2. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Computer Vision 60, 91–110 (2004)
A Direct Method for Estimating Planar Projective Transform
281
3. Nemeth, J., Domokos, C., Kato, Z.: Recovering planar homographies between 2d shapes. In: Proc. Int. Conf. on Computer Vision, pp. 889–892 (2009) 4. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 381–395 (1981) 5. van Gool, L., Moons, T., Pauwels, E., Oosterlinck, A.: Vision and Lies approach to invariance. Image and Vision Computing 13, 259–277 (1995) 6. Flusser, J., Suk, T.: Projective moment invariants. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1364–1367 (2004) 7. Domokos, C., Kato, Z., Francos, J.M.: Parametric estimation of affine deformations of binary images. In: Proc. of Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 889–892 (2008) 8. Baird, H.: Model-based image matching using location. MIT Press, Cambridge (1984) 9. Besl, P., Jain, R.: Three dimensional object recognition. ACM Computing Surveys 17, 75–145 (1985) 10. Grimson, E.: Object recognition by computer: The role of geometric constraints. MIT Press, Cambridge (1990) 11. Hinton, G., Williams, C., Revow, M.: Adaptive elastic models for hand-printed character recognition. In: Advances in Neural Information Systems, pp. 512–519 (1992) 12. Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 850–863 (1993) 13. Feldmar, J., Ayache, N.: Rigid, affine and locally affine registration of free-form surfaces. Int. J. Computer Vision 18, 99–119 (1996) 14. Metaxas, D., Koh, E., Badler, N.: Multi-level shape representation using global deformation and locally adaptive finite elements. Int. J. Computer Vision 25, 49– 61 (1997) 15. Hofmann, T., Buhmann, J.: Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 1–14 (1997) 16. Cross, A., Hancock, E.: Graph matching with a dual-step EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1236–1253 (1998) 17. Gold, S., Rangarajan, A., Liu, C., Pappu, S., Mjolsness, E.: New algorithms for 2d and 3d point matching: pose estimation and correspondence. Pattern Recognition 31, 1019–1031 (1998) 18. Chui, H., Rangarajan, A.: A new point matching algorithm for non-rigid registration. Computer Vision and Image Understanding 89, 114–141 (2003) 19. Cao, Y., Miller, M., Winlow, R., Younes, L.: Large deformation diffeomorphic metric mapping of vector fields. IEEE Trans. on Med. Img. 24, 1216–1230 (2005) 20. Ho, J., Peter, A., Rangarajan, A., Yang, M.H.: An algebraic approach to affine registration of point sets. In: Proc. Int. Conf. on Computer Vision, pp. 1335–1340 (2009) 21. Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 850–863 (1993) 22. Ridler, T., Calvard, S.: Picture thresholding using an iterative selection method. SMC 8, 629–632 (1978)
Spatial-Temporal Motion Compensation Based Video Super Resolution Yaozu An, Yao Lu, and Ziye Yan Beijing Laboratory of Intelligent Information Technology, School of Computer Science and Technology,Beijing Institute of Technology, Beijing 100081 China {bitanyz,vis yl,yanziye}@bit.edu.cn
Abstract. Due to the arbitrary motion patterns in practical video, annoying artifacts cased by the registration error often appears in the super resolution outcome. This paper proposes a spatial-temporal motion compensation based super resolution fusion method (STMC) for video after explicit motion estimation between a few neighboring frames. We first register the neighboring low resolution frames to proper positions in the high resolution frame, and then use the registered low resolution information as non-local redundancy to compensate the surrounding positions which have no or a few registered low resolution pixels. Experimental results indicate the proposed method can effectively reduce the artifacts cased by the motion estimation error with obvious performance improvement in both PSNR and visual effect. Keywords: Super resolution, Motion compensation, Non-Local, Video fusion.
1
Introduction
Super resolution is to fuse a higher resolution image by means of several lowresolution images, which is well studied in the past decades and its applications can be widely found in a broad range of image and video processing task such as aerial photo, medical imaging, video surveillance etc [1]. Video super resolution can be considered as an application and extension of super resolution and it aims at enhancing desirable quality of videos with better optical resolution. Because the typical resolution of modern display devices (large LCD-panel displays) is often much higher than that of videos, video super resolution has a widely application future in HDTV display, internet video broadcasting and so on. Generally speaking, a typical dynamic video super resolution method involves three independent processes: motion estimation (registration), fusion and restoration. The accuracy degree of the motion estimation takes an important role in the reconstruction process. Once the estimated motion between the frames is inaccurately, the disturbing artifacts cased by the registration error often appears in the reconstructed outcome of classic super resolution when compared to the given measurements. For this reason, classic super resolution assumes R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 282–292, 2011. Springer-Verlag Berlin Heidelberg 2011
Spatial-Temporal Motion Compensation Based Video Super Resolution
283
that the whole motion degree of pixels is consistent between frames, such as the global translational motion or affine motion. Only under these simplified assumptions, highly accurate general motion estimation, known as optical flow, can be achieved in the reconstruction process with available super resolved image. Due to the arbitrary motion patterns in practical video, general content videos are not likely to be handled well under the above simplified assumptions. In order to find accurate motion between video frames, many motion estimation approaches have been proposed in the literatures. Assume that there is independent object motion in video frames and the pixels in a local area have a similar displacement, block matching based methods have been used to determine sub-pixel displacements between video frames in the super resolution [2][4]. Schultz et al propose a MAP-HMRF (Huber-Markov Random Field) based video reconstruction framework with a hierarchical sub-pixel motion estimation method determining the motion vectors between frames [2]. Jiang et al adopt a quality measure based on correlation coefficient between frames to revise the optical flow estimation errors and model the PSF via an elliptical weighted area filter [3]. Liu et al propose a unified reconstruction framework for noisy video that achieves simultaneous denoising and super resolution [5]. Hung et al propose a translational motion compensation model via frequency classification for video super-resolution systems and then fuse the high resolution frame [6]. Omer et al first segment the reference frame into deferent regions using watershed segmentation and then adaptively weight each region according to the registration error and the simultaneously estimate regularization parameter for each region [7]. References [8][9] propose a high accuracy TV-L1 optical flow based on quadratic relaxation scheme to estimate the relative motion between video frames, and implement the complete algorithm on the GPU by using CUDA framework real time and reach a high performance. Reference [10] extends the above TV-L1 optical flow method to the classical motion based super resolution framework. Reference [12] proposes a non local iterative back projection algorithm for image enlargement by grouping the low resolution pixels and the back projection error iteratively. Nevertheless, searching similar examples in the whole image sequence is infeasible in practice. In order to insure the quality of the algorithm, a proper size of pixel neighborhoods (block size) and proper searching range need to be selected experientially. In this paper, we propose a spatial-temporal motion compensation based super resolution fusion method (STMC) for video using a few low resolution frames information surrounding the reference frame after the explicit motion is obtained. We first estimate the motion vectors between the forward and the backward low resolution frames to the reference low resolution frame, then register the low resolution frames to proper positions in the high resolution frame. There may be many undesirable artifacts in the fused high resolution frame caused by the arbitrary motion patterns in practical video, such as the newly entered pixels from behind the objects or the lost pixels from the surrounding frames to the reference frame. In order to resolve these problems, we use the non-local redundancy pixels from the registered frames around the ghost pixels to improve
284
Y. An, Y. Lu, and Z. Yan
the image fusion quality. The structure of the paper is organized as follows. In section 2, the observation model and the cost function are briefly presented; In section 3, we introduce the spatial-temporal motion compensation based video super resolution fusion (STMC) framework using a few frames; Experimental result and discussion are presented in section 4, and conclusions are shown in section 5.
2
Problem Description
Super resolution has been widely studied in the past decades and a universal and popular observation model has been used to represent the relation between the low and high resolution image. Assume that p observed low resolution video frames, each of size N1 ×N2 and yk represents the degraded frame at discrete time instant k. The deterministic high resolution frame Xt at discrete time instant t is of size LN1 × LN2 and L is the down-sampling factor. All vectors are ordered lexicographically and the linear image degradation model can be represented as [1]: yk = DBF(k,t) Xt + nt , k, t = 1, 2, ...p. (1) where nk represents random Gaussian zero mean noise with variance σk2 . The LN1 LN2 × LN1 LN2 matrices B and F(k,t) model the linear space invariant blurring operator and warping operator from the t th high resolution image to the k th high resolution image. The N1 N2 ×LN1LN2 matrix D models the decimation operator from high resolution image to the measured low resolution image. We assume the operators D and B are identical in the video sequence. Resolving the deterministic high resolution frame in Eq. (1) is an inverse problem. The Maximum-Likelihood (ML) framework is a basic and important research method to formulate the inverse problem well-posed and to find the optimized solution. Hence the so-called inverse problem in Eq. (1) can be formulated as: p yk − DBF(k,t) Xt 2 }, k, t = 1, 2, ...p. Xˆt = argmin{Σk=1
(2)
where yk − DBF(k,t) Xt 2 is the residual term of the k th low resolution channel which enforces faithfulness of the solution to t th the high resolution frame.
3
Proposed Algorithm
Generally speaking, a typical dynamic video super resolution method involves three independent processes: motion estimation (registration), fusion and restoration. We first separately estimate the motion between the frames using the high accuracy TV-L1 optic flow algorithm introduced in [8] and then we use the spatial-temporal motion compensation based super resolution fusion (STMC) method to reconstruct a high resolution frame from the low resolution frames as the initial estimation for the super resolution process. There are no or a few
Spatial-Temporal Motion Compensation Based Video Super Resolution
285
registered low resolution pixels in the desired high resolution image when no enough low resolution images used in the super resolution process. Due to the arbitrary motion in video, some low resolution pixels may be registered to wrong locations. In order to overcome ghost artifacts, in the fusion step, encouraged by the very successful Non-Local-Means denoising method, we use the non local pixels from the registered low resolution frames to compensate a high resolution pixel. In other words, we use the registered low resolution pixel from the neighboring frames as non local redundancy to compensate its surrounding pixels in the high resolution image. 3.1
Registration by TV-L1 Optic Flow
The major difficulty in applying the above super resolution approach is that the motion between frames F(k,,t) is generally unknown. Before fusing the low resolution frames, we should first separately estimate the motion between frames using the high accuracy TV-L1 optic flow algorithm introduced in [8] and then we use the motion matrices to fuse the initially high resolution frame. The TV-L1 optical flow algorithm based on quadratic relaxation scheme estimate the relative motion between video frames, and can be implement the complete algorithm on the GPU by using CUDA framework real time and reach a high performance. The detail of the TV-L1 optic flow algorithm can be found in reference [8]. Fig.1 shows the motion estimation result of the optical flow method between two neighboring frames in low resolution. We can see that the estimated motion is rough in the low resolution frames with no enough information, especially in the edge of the moving objects. Considering the arbitrariness of the motion in general video and the roughness of motion estimation in low resolution, we model the fusion of the registered low resolution video frames in Fig.2. We can see that
Fig. 1. The motion between two neighboring low resolution frames is estimated by the TV-L1 optic flow algorithm
286
Y. An, Y. Lu, and Z. Yan
Fig. 2. Surrounding low resolution frames are fused in high resolution. The red points are low resolution pixels from the reference frame with accurate registration. The pixels (the green and blue points) from neighboring frames are distributed inconsistently cased by the arbitrary motion patterns. And the black points which have no enough registered pixels need to be compensated using the non local information.
the motion of the low resolution pixels is inconsistent and the distribution in the high resolution is not uniform. In some locations, there are no registered pixels in surrounding (the black points). In that case, the artifacts will appear in the reconstructed high resolution image inevitably. As described in [11][12], natural images often contain repeated patterns and structures. In order to improve the quality of the results, we can employ the non local registered pixels around the locations to compensate the no enough information pixels. 3.2
Spatial-Temporal Motion Compensation Based Super Resolution Fusion Framework (STMC)
After the explicit motion estimation between the video frames, the high resolution video frame fusion method will be introduced in this section using the spatial-temporal motion compensation from the registered frames. At first, we focus on the Maximum-Likelihood (ML) function in the fusion step and rewrite it into a pixel-wise high resolution fusion equation. The derivative of the fusion function which is a linear sum of equations is given by ∂f (Xˆt ) p T = Σk=1 F(k,t) B T DT (yk − DBF(k,t) Xt ), k, t = 1, 2, ...p. ∂Xt
(3)
We simplify the above equation and can get a linear sum of the diagonal matrices to express the fusion intuitively Xˆt =
p T Σk=1 F(k,t) B T D T yk
p T B T D T DBF Σk=1 F(k,t) (k,t)
, k, t = 1, 2, ...p.
(4)
Spatial-Temporal Motion Compensation Based Video Super Resolution
287
p T In the above equation, the term Σk=1 F(k,t) B T DT yk means that the sum of the pixels in high resolution by interpolating the k th low resolution frame using zeropadding, and by putting it back into proper location estimated by the motion p T F(k,t) B T DT DBF(k,t) assures that the number of the low matrix. The term Σk=1 pixels grouped in the proper positions in high resolution. In order to get an intuitively pixel-wise fusion equation, denote X(i, j) = Uk (Lm + dx, Ln + dy). Uk is the interpolation of the k th low resolution frame yk by zero-filling, L is the scaling factor in super resolution and (dx,dy) is the motion estimation of the pixel yk (m, n). Hereafter, we can further simplify the fusion equation to a pixel-wise version as p Σyk (m, n) ˆ t (i, j) = Σk=1 X , k, t = 1, 2, ...p. p Σk=1 ΣIk (i, j)
(5)
p where the relation between locations is (i=Lm+dx,j=Ln+dy), and Σk=1 ΣIk (i, j) is the number of the pixels registered to the position (i,j) from the low resolution frames. Because of the existence of the arbitrary motion in the general videos, accurate sub pixel motion estimation is difficult and computationally expensive and the registration error comes inevitably. In addition to the inconsistent distribution of the low resolution pixels from the neighboring frames in high resolution, the distortion artifacts inevitably affects the quality of the super resolved result. This matter can be weakened with high computationally cost when enough low resolution frames are used in fusion. As showed in Fig.2, in some position of the high resolution fusion frame, there are no registered pixels around or only a few registered pixels. Encouraged by the method in references [11][12], which generalize the very successful Non-Local-Means (NLM)[11] denoising method to performing super resolution by grouping of the pixels in the non-local area and by incorporating the non-local redundancy. To improve the matter that no enough registered low resolution pixels around some positions, the non-local registered low resolution pixels can be grouped to improve the frame quality. We use block matching method to compute the distance between the patches around the compared pixels as the measure for similarity. Let Yk (Lm, Ln) be the initial interpolation (eg. Bicubic, Lanczos2) of the k th low resolution frame in video and (m,n) be the location in the corresponding low resolution frame. X denotes the initial interpolation of the reference frame and PZ (i, j) is the patch around pixel (i,j). Let Pk (m, n) represent the patch around pixel Yk (Lm, Ln) corresponding to the low resolution pixel yk (m, n). So the pixel-wise fusion function can be rewritten as p k (m, n)yk (m, n) ˆ t (i, j) = Σk=1 Σw , k, t = 1, 2, ...p. (6) X p Σk=1 Σwk (i, j)
where (Lm + dx, Ln + dy)N (i, j) means the low resolution pixel yk (m, n) can be used to estimate the high resolution pixel Xt (i, j) and N(i,j) is the neighborhood or a searching window around the pixel (i,j). As in the non-local means based denoising and super resolution method, we compute the weight wk (m, n) using the exponential function.
288
Y. An, Y. Lu, and Z. Yan
Pk (m, n) − PZ (i, j)2 ), k, t = 1, 2, ...p. (7) 2σ 2 In the following step, the gradient descent optimization algorithm can be used to reconstruct the high-resolution frame by minimizing the cost function Eq.(2). The expected high resolution Xt can be updated iteratively using IBP (Iterative Back-Projection) or NLIBP (Non local IBP)[12] beginning with an initial estimation which is fused from the spatial temporal motion compensation based super resolution method (STMC). wk (m, n) = exp(−
4
Experimental Result and Discussion
In this section, we demonstrate the potential of the proposed spatial-temporal motion compensation based algorithm by presenting the results for video sequences with a general motion pattern. The original YUV video sequences for testing are obtained from the website: http://trace.eas.asu.edu/yuv/index. html. In order to make the experiment be more representative, we choose three videos (“Bus”, “Foreman” and “Mother-daughter”) with different scene. The performance of the proposed algorithms is quantitatively evaluated by measuring the PSNR (Peak Signal-to-Noise Ratio). It is defined by P SN R = 10log10
2552 × M . ˆ − X2 X
(8)
ˆ and where M is the total number of pixels in the high-resolution image, and X X are the estimated high resolution image and the original image respectively. Table 1. The average PSNR of the videos PSNR
Bus×4 Foreman×4 Mother-daughter×4
Bicubic 19.24 NLIBP [12] 20.93 STMC 21.20
25.67 29.41 29.91
27.40 30.14 30.36
In order to further evaluate our proposed spatial temporal motion compensation super resolution fusion method (STMC), we choose the bicubic interpolation method and NLIBP in [12] for comparison. The search area in NLIBP[12] is 21 × 21, and the patch size is 7 × 7. In the simulation experiment, the test sequences from videos “Bus”, “Foreman” and “Mother-daughter” are first blurred using a 3×3 uniform mask, decimated by 1:4 each axis and then contaminated by additive white zero-mean Gaussian noise with standard deviation 2. We choose the first 10 frames from “Bus”, “Foreman” and “Mother-daughter” to generate the degraded sequence, and we just use 5 frames (the forward two frames, the reference, and the backward two frames) to reconstruct the reference frame in the proposed algorithm. In our experiments, motion compensation area is L × L
Spatial-Temporal Motion Compensation Based Video Super Resolution
289
Fig. 3. The super resolved result of the video “Bus”. From left to right: the 3rd frame, the 5th frame and the 7th frame. From top to bottom: The low resolution, The high resolution, Bicubic, NLIBP [12], Our proposed STMC.
(resampling factor), and the patch size is also 7×7. Fig.3, Fig.4 and Fig.5 present the obtained result of videos “Bus”, “Foreman” and “Mother-daughter”. Table. 1 presents the average PSNR of the video super resolution results. From the comparison, under the condition that only few low resolution frames used for super resolution, the result of the proposed method outperforms the bicubic interpolation with 3-4dB in PSNR. Compared with NLIBP [12], the proposed STMC method which fusing 5 neighbor frames can obtain clearer results with higher PSNR. In the experiment, using the registered information, the STMC method
290
Y. An, Y. Lu, and Z. Yan
Fig. 4. The super resolved result of the video “Foreman”. From left to right: the 3rd frame, the 5th frame and the 7th frame. From top to bottom: The low resolution, The high resolution, Bicubic, NLIBP [12], Our proposed STMC.
Spatial-Temporal Motion Compensation Based Video Super Resolution
291
Fig. 5. The super resolved result of the video “Mother-daughter”. From left to right: the 3rd frame, the 5th frame and the 7th frame. From top to bottom: The low resolution, The high resolution, Bicubic, NLIBP [12], Our proposed STMC.
can compensate the registration error effectively. So the result is better than NLIBP not only in visual effect but also in PSNR.
5
Conclusion
In this paper, we have presented a spatial-temporal motion compensation based super resolution fusion algorithm (STMC) to reconstruct the practical video with arbitrary motion patterns. Before super resolution, we estimate the explicit
292
Y. An, Y. Lu, and Z. Yan
motion between video frames by high accuracy TV-L1 optic flow algorithm. Hereafter, we register the surrounding low resolution frames to proper location in high resolution, and then use the registered information as the non-local redundancies to improve the pixels in high resolution without enough registered low resolution pixels. The experiments show that the proposed method can effectively reduce the artefacts in the super resolved result. Acknowledgement. This paper is partially supported by the Beijing Key Discipline Construction Program.
References 1. Park, S.C., Park, M.K., Kang, M.G.: Super-resolution image reconstruction: A technical overview. IEEE Signal Process Magazine 20, 21–36 (2003) 2. Schultz, R.R., Stevenson, R.L.: Extraction of high resolution frames from video sequences. IEEE Transactions on Signal Processing 5, 996–1011 (1996) 3. Jiang, Z., Wong, T., Bao, H.: Practical Super-Resolution from Dynamic Video Sequences. In: CVPR 2003, vol. 2, pp. 549–554 (2003) 4. Simonyan, K., Grishin, S., Vatolin, D., Popov, D.: Fast video super-resolution via classification. In: ICIP 2008, pp. 349–352 (2008) 5. Liu, F., Wang, J., Zhu, S., Gleicher, M., Gong, Y.: Noisy Video Super-Resolution. In: ACM Multimedia 2008, pp. 713–716 (2008) 6. Hung, K.W., Siu, W.C.: New Motion Compensation Model Via Frequency Classification For Fast Video Super-Resolution. In: ICIP 2009, pp. 1193–1196 (2009) 7. Omer, O.A., Tanaka, T.: Region-based weighted-norm approach to video superresolution with adaptive regularization. In: ICASSP 2009, pp. 833–836 (2009) 8. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 9. Wedel, A., Pock, T., Zach, C., Bischof, H., Cremers, D.: An Improved Algorithm for TV-L1 Optical Flow. Statistical and Geometrical Approaches to Visual Motion Analysis, 23–45 (2008) 10. Mitzel, D., Pock, T., Schoenemann, T., Cremers, D.: Video Super Resolution using Duality Based TV-L1 Optical Flow. In: Denzler, J., Notni, G., S¨ uße, H. (eds.) Pattern Recognition. LNCS, vol. 5748, pp. 432–441. Springer, Heidelberg (2009) 11. Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: CVPR 2005, vol. 2, pp. 60–65 (2005) 12. Dong, W., Zhang, L., Shi, G., Wu, X.: Nonlocal Back-Projection For Adaptive Image Enlargement. In: ICIP 2009, pp. 349–352 (2009)
Learning Rare Behaviours Jian Li, Timothy M. Hospedales, Shaogang Gong, and Tao Xiang School of EECS, Queen Mary University of London Mile End Road, London, E1 4NS, UK {jianli,tmh,sgg,txiang}@dcs.qmul.ac.uk
Abstract. We present a novel approach to detect and classify rare behaviours which are visually subtle and occur sparsely in the presence of overwhelming typical behaviours. We treat this as a weakly supervised classification problem and propose a novel topic model: Multi-Class Delta Latent Dirichlet Allocation which learns to model rare behaviours from a few weakly labelled videos as well as typical behaviours from uninteresting videos by collaboratively sharing features among all classes of footage. The learned model is able to accurately classify unseen data. We further explore a novel method for detecting unknown rare behaviours in unseen data by synthesising new plausible topics to hypothesise any potential behavioural conflicts. Extensive validation using both simulated and real-world CCTV video data demonstrates the superior performance of the proposed framework compared to conventional unsupervised detection and supervised classification approaches.
1
Introduction
Detecting rare behaviours in videos of dynamic scenes is important for video behaviour analysis. Existing studies usually treat this problem as either an outlier detection problem, flagging any behaviour which is badly explained by a model representing typical behaviours [1–5]; or a supervised classification problem when examples of rare behaviours are known and available for training an explicit model of rare versus typical behaviours [6, 7]. For both, the real challenge in model building is that visual evidence for rare behaviours is subtle and under-represented, which makes them neither distinctive enough to be detected among many concurrently occurring typical behaviours, nor repeated enough to be modelled precisely. Examples shown in Fig. 1 illustrate the difficulties of this problem. Two rare behaviours of single object (rare turns) are highlighted by red arrows, surrounded and overwhelmed by typical behaviours (yellow arrows) of dozens of other objects. In addition, both these rare turns are short in duration with few frames containing the crucial motion features that distinguish them from typical behaviours. Moreover, as rare behaviours, very few samples are available, making it difficult to model them reliably. All these issues cause both existing supervised and unsupervised detection models to fail. In this work, we propose a novel approach to solve the problem of rare behaviour modelling and detection by treating it as a weakly supervised classification problem. In particular, we wish to exploit weak labelling of video footage as R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 293–307, 2011. Springer-Verlag Berlin Heidelberg 2011
294
J. Li et al.
Fig. 1. Rare behaviours (red) may have weak visual evidence and usually co-occur with other typical behaviours (yellow)
additional information about any rare behaviours. For example, whilst the majority of daily CCTV recordings may be labelled as ‘uninteresting’ by default, a few clips may also be labelled as being interesting, containing rare occurrences of some interesting behaviours such as ‘U-turn’, ‘Jay-Walking’ and ‘tailgating’ but without exact information about where and what are the triggering rare behaviours. The challenge is how to quantify and utilise such weak labelling information, which is mostly qualitative (or binary), in order to overcome underrepresentation of those rare behaviours in model learning. To that end, given a set of such weakly labelled video footage, we introduce a novel Multi-Class Delta Latent Dirichlet Allocation (MC-ΔLDA) model, which uses weak supervision to jointly learn a model for both typical and rare behaviours with a partially shared representation. Online behaviour classification is performed by inference in a learned MC-ΔLDA model, which requires accurately computing marginal likelihoods. To this end, we formulate an importance sampler to estimate the marginal likelihood using a variational mean field approximation to the optimal proposal. Our framework exploits the benefits of both robust generative topic modelling and supervised classification. Importantly, this is without imposing the cost of and assuming consistency in fully supervised labelling. The proposed model permits us to learn to detect and classify different types of rare behaviours, even when they are visually subtle and buried in cluttering typical behaviours. In contrast to supervised discriminative classifiers, our method retains the generative model potential to detect completely new and unseen rare behaviours as outliers due to their low likelihood – but with the already identified caveats of subtle and buried rare behaviours being challenging to detect as outliers. As a second contribution, we show how our framework can alternatively exploit very general prior knowledge about a given scene to predict unseen rare behaviours, and therefore detect them within the same MC-ΔLDA classification framework. 1.1
Related Work
Rare behaviour detection is typically formulated as outlier detection against a learned generative model of typical behaviours. For example, previous efforts have learned a dynamic Bayesian network (DBN) [1] or Gaussian mixture model (GMM) [2] from unlabelled video data, and threshold the likelihood of test data under these models to detect rare behaviour. These models however suffer from lack of robustness to noise and clutter in crowded scenes. Recently a number
Learning Rare Behaviours
295
of studies on modelling complex scenes have instead exploited statistical topic models such as Latent Dirichlet Allocation (LDA) [8] to cluster low level features into topics – or activities in video – for robust modelling of behaviour [3–5]. Topic models aim to automatically discover meaningful activities via co-occurrence of visual words. They can deal with simultaneous interaction of multiple objects or activities. Despite these advantages, all unsupervised topic models [3–5] for rare behaviour detection must compute how well new examples can be explained by the learned typical behaviour model. This exposes their key limitation: unsupervised topic models are only sensitive to rare behaviours which are visually very distinct from the majority. That is, visually subtle rare behaviours (i.e. sharing substantial visual words of typical behaviours) are usually overwhelmed (explained away) by the more obvious and common behaviours, and therefore cannot be detected. Moreover, all outlier detection based approaches [1–5] have no mechanism to classify different types of rare behaviours. Alternatively, supervised classification methods have also been applied to classify behaviours given sufficient and unbiased labelled training data [6, 7]. However, they are intrinsically limited in modelling rare behaviours due to underrepresented rare data, making it difficult to build a good decision boundary. More importantly, even in a video clip with rare behaviours, the vast majority of observations are typical. This means that for a supervised classifier, the majority of video features in the clip are irrelevant without very specific, expensive and hard-wired supervision. In practice, weakly-supervised and in particular multiinstance learning (MIL) [9, 10] is preferred. Different from existing MIL methods which are based on discriminative models such as SVM [10] or boosting [9], our MC-ΔLDA is a generative model which retains the robustness of generative topic modelling whilst exploiting weak supervision (cheap and scalable) for both rare and typical behaviour classification in a single framework. MC-ΔLDA intrinsically differs from Supervised Topic Models (sLDA) [11] in that it uses weak supervision to control the existence of a particular topic in a video clip instead of controlling how all topics are mixed, as in sLDA. More importantly, MC-ΔLDA is a classifier whereas sLDA was used for regression and generalising it to classification is non-trivial. Our algorithm is able to learn joint models of both typical and rare behaviours even if they co-exist. MC-ΔLDA is a generalisation and completion of the ΔLDA model proposed in [12]. ΔLDA was used for understanding code bugs in computer programs, but without an inference framework for the labels of unseen documents, ΔLDA cannot be used for classification. We extend ΔLDA (two-class only) by generalising it to MCΔLDA, which can jointly learn multi-class rare behaviours. Moreover, we complete the model by formulating an efficient and accurate inference algorithm to enable MC-ΔLDA to classify known rare behaviours in unseen data. We further explore MC-ΔLDA for detecting completely unknown rare behaviours through synthesising topics. Learning to classify by synthesis has been studied before for learning outlier detectors using classification models [13, 14]. However, such models synthesise without constraint, so predicted instances can be invalid and also risk double-counting evidence. Our MC-ΔLDA differs significantly in that it
296
J. Li et al.
explores constrained synthesis under prior contextual knowledge in order to generate hypotheses that can best explain unknown but plausible rare behaviours. We extensively evaluate MC-ΔLDA using video data captured from two busy traffic scenes, and show its superiority for both classification and detection of known and unknown rare behaviours compared to existing supervised and unsupervised models.
2 2.1
Classify and Detect Rare Behaviours by MC-ΔLDA Video Clip Representation
Training a topic model requires a bag of words representation of video. To create visual words, we first segment a scene into Na × Nb equal size cells. Within each cell, we compute optical flow vectors for each pixel and the average vector for the cell. We quantise the cell motion into one of Nm directions. This results in a codebook of size Nw = Na ×Nb ×Nm . To create visual documents, we temporally segment a video into Nd non-overlapping clips, and a document consists of the set of quantised words in that clip. In this work, a corpus consists of Nd documents Nj d D = {xj }N j=1 each of which is a bag of Nj words xj = {xj,i }i=1 . 2.2
Multi-Class Delta Latent Dirichlet Allocation (MC-ΔLDA)
Standard topic models such as LDA [8] (Fig. 2 (a)) model each instance (clip) as being derived from a bag of topics drawn from a fixed (and usually uniform) set of proportions. In MC-ΔLDA, we wish to constrain the topic proportions nonuniformly and on a per-clip basis. This will enable some topics to be shared (perpetually ongoing typical behaviours) and some to be uniquely associated with particular interesting classes, thereby representing the unique aspects of that behaviour. The new model structure is shown in Fig. 2 (b). Each clip xj is assumed Nc to be represented by a bag of words drawn from a mixture of Nt = c=0 Nt,c possible topics, where Nt,c is the number of topics allocated to behaviour c. Let Tc be the Nt,c element list of topics for behaviour c. Each clip is assumed to belong to a specific behaviour class cj ∈ 0, 1, · · · , c, · · · , Nc . cj = 0 indicates that clip xj contains only typical behaviours; and cj = 1, · · · , c, · · · , Nc indicates that clip xj contains both typical behaviours and also a type c rare behaviour
(a) LDA
(b) MC-ΔLDA
Fig. 2. LDA vs. MC-ΔLDA model structures
Learning Rare Behaviours
297
(in unknown proportions and at unknown locations). This is enforced by defin t ing a class specific prior over topics αc = {αct }N / T0 Tc t=1 where elements t ∈ are constrained to be zero. In this study we set all non-zero elements of αc to 0.5, and β is learned with Gibbs-expectation maximization [15]. The generative process of MC-ΔLDA is summarized as follows: 1. Draw a Dirichlet word-topic distribution φt ∼ Dir(β) for every topic t; 2. For each document j: (a) Draw a class label cj ∼ Multi(ε); (b) Given label cj , draw a constrained topic distribution θj ∼ Dir(αcj ); (c) Draw a topic yj,i for each word i from multinomial yj,i ∼ Multi(θj ); (d) Sample a word xj,i according to xj,i ∼ Multi(φyj,i ). During the training phase, we assume weak supervision is provided in the form of labels cj , and the goal is to determine the topics for each clip (i.e. θj and yj ) and their visual representation (i.e. Φ), including ongoing typical behaviours, and unique rare behaviours. In the testing phase, the class label of an unseen document x∗ is not available and our objective is to infer its class, p(c|x∗ ). 2.3
Learning: Modelling Rare Behaviours with Weak Supervision
The full joint probability of a document j in MC-ΔLDA is p(xj , yj , θj , Φ, c|α, β, ε) = p(xj,i |yj,i , Φ)p(yj,i |θj )p(θj |α, c)p(c|ε)p(Φ|β). (1) i
As for standard LDA, exact learning in our model is intractable. However a collapsed Gibbs sampler can be derived to sample the topic posterior p(y|x, c, α, β) (now additionally conditioned on the current class c) leading to the update c
j n−i n−i y,d + αy x,y + β p(yj,i | yj,−i , x, c, α, β) ∝ −i cj . x nx,y + β y ny,d + αy
(2)
Here yj,−i indicates all topics except the token i; n−i x,y indicates the counts of topic y being assigned to word x, excluding the current item i; and n−i y,d indicates the count of topic y occurring in the current document d. These counts are also used to point estimate Dirichlet parameters θ and Φ by their mean, e.g. nx,y + β Φˆx,y = . x nx,y + β
(3)
Note that in contrast to standard LDA, the class constrained priors αcj in Eq. (2) enforce that the topic posterior yj,i (for all words i in clip j) is zero except for topics permitted by the current class cj (T0 Tcj ). Topics Tc=0 will be well constrained by the abundant typical data. However, all clips of some interesting class c > 0 may use unique extra topics Tc in their representation. These will therefore come to represent the unique aspects of interesting class c.
298
2.4
J. Li et al.
Inference: Classification of Known Rare Behaviours
The learning problem formulated above exploited labelled behaviours c. For online classification of new/unseen test data x∗ , the goal is to estimate c = argmaxc p(c|x∗ , D, α, β, ε) where p(c|x∗ , D, α, β, ε) ∝ p(x∗ |D, c, α, β)p(c|ε), p(x∗ |D, c, α, β, ε) = p(x∗ |y, Φ)p(y|θ)p(Φ|D)p(θ|α, c)dθdΦ,
(4) (5)
y
The challenge is then to accurately estimate the marginal likelihood in Eq. (5). Firstly, we approximate p(Φ|D) = δΦˆ (Φ), taking Φ to be fixed at single point estimate Φˆ from the final Gibbs sample during learning (Eq. (3)). We point out that reliably computing the marginal likelihood in topic models is an active research area [16, 17] due to the intractable sum over correlated y in Eq. (5). We take the view of [16, 17] and define an importance sampling approximation 1 p(x∗ , ys |c) , ys ∼ q(y|c), (6) p(x∗ |c) ≈ S s q(ys |c) where we drop conditioning on the parameters for clarity. Different choices of proposal q(y|c) induce different estimation algorithms. The optimal proposal qo (y|c) is proportional to p(x∗ , y|c), which is unknown. We can however apply the variational mean field approximation qmf (y|c) = i qi (yi |c) with minimal Kullback-Leibler divergence to the optimal proposal by iterating ⎛ ⎞ qi (yi |c) ∝ ⎝αcy + ql (yl |c)⎠ Φˆxi ,yi . (7) l=i
Note that this importance sampling proposal (Eq. (7)) is much faster and more accurate than the standard approach [16, 18] of using posterior Gibbs samples in Eq. (6), which results in the unstable harmonic mean approximation to the likelihood – an estimator which has huge variance in theory and in practice [16]. This is important for us because classification speed and accuracy depends on the speed and accuracy of computing marginal likelihood (Eq. (5)). In summary, to classify a new clip, we use the importance sampling approach defined in Eqs. (6) and (7) to compute the marginal likelihood (Eq. (5)) for each behaviour c (i.e. typical 0, rare 1,2,...) and hence the class posterior for that clip (Eq. (4)). This works because of the switching structure of the model. If a clip contains some known rare behaviour, only the sampler for that category can allocate the corresponding rare topics and obtain a high likelihood, while samplers for other categories will only be able to explain typical activities in the clip, thereby obtaining lower likelihood. 2.5
Synthesising Topics: Detection of Unknown Rare Behaviours
So far, the proposed MC-ΔLDA model is designed for classifying a new unseen clip as one of the known rare behaviour (or typical) classes. In practice, we
Learning Rare Behaviours
299
also want to detect other types of rarely occurring behaviours that could interest users, but for which no examples were available during training. A standard strategy for detecting unknown rare behaviours is to detect statistical deviation from the learned model of existing behaviours. Using MC-ΔLDA, this corresponds to computing and the marginal likelihood of the entire MC-ΔLDA thresholding Nc ∗ ∗ model p(x∗ ) = c=0 p(x |c)p(c) where p(x |c) is computed in Eq. (5). However, this equates to the standard approach of unsupervised behaviour detection and suffers from the same drawbacks as unsupervised topic models. To address this problem, we convert the problem of unknown rare behaviour detection to a MC-ΔLDA classification problem by utilising prior knowledge. Our solution lies in employing existing learned topics to synthesise new topics which are likely to represent interesting (and plausible) unknown rare behaviours in a specific scene, and inserting them as new categories in MC-ΔLDA. To that end, we exploit prior knowledge, e.g. analysis of how known rare topics relate to typical topics. In particular, we explore prior domain knowledge about spatio-temporal conflict of multiple existing topics to synthesise new topics of unknown behaviours. Although these behaviours do not cover every possible case in the real world, they are usually of particular interest but can easily be missed by standard approaches based on outlier detection [1, 3–5]. The synthesising procedure is as follows: 1. Detecting conflict regions: For an existing topic φt , identify the spatial location of dominant words to generate a binary matrix M (φt ) in the image t domain. Compute and threshold the matrix M = N t=1 M (φt ) and connect components to identify Nt potential conflict regions {Rt } in each of which Nr existing topics are spatially overlapped. Each region Rt is associated with a map Mt in image domain to indicate locations of the region. r 2. Synthesising new topics: Given a map Mt , compute φt = N t=1 φt in which Nr existing topics {φt } in Φ sharing space with the map Mt . Identify dominant words in φt and normalise their probabilities. Set the values of h , where h is the mean probability of dominant non-dominant words to #φ t words in all existing topics Φ and #φt is the number of non-dominant words in φt . Normalise φt to obtain a multinomial distribution. Given the synthesised additional topics that represent plausible conflicts among existing topics, we employ the MC-ΔLDA framework for classifying unseen behaviours. This is performed as in Section 2.4 after concatenating the synthesised topics Φe = {φt } with existing topics Φ and adding another class label c to represent all unknown behaviours.
3 3.1
Experiments Learning Rare Topics in Simulated Data
To clearly understand our model’s performance, we first evaluate our method using simulated data where ground truth is known (Fig. 3). We defined 11 topics
300
J. Li et al.
Fig. 3. Learning MC-ΔLDA to classify simulated data
(2D structural patterns), of which 8 (bar patterns) were typical and 3 (star patterns) were rare topics (Fig. 3(a)). For training, we generated (1) 500 documents (images) of only typical topics and (2) 10 documents for each rare category by sampling from the corresponding αc s (Fig. 3(b)). Note that purely typical documents are visually very similar to those also containing rare topics, as both are dominated by the same typical topics which also share words with the rare topics. Topics Φ learned by our model using this dataset (Fig. 3(b)) are shown in Fig. 3(c). Clearly, despite the extreme conditions of few rare documents (2%), and dominant typical topics within each rare document, MC-ΔLDA is capable of learning correctly all the rare patterns in the training data. We then compared the performance of MC-ΔLDA against that of standard unsupervised LDA [18] for detecting rare topics in additional new testing documents generated separately from the training data. Standard LDA was trained using only typical documents and tested as an outlier detector. By varying the MC-ΔLDA priors p(c|ε), the performance of both models can be compared using Receiver Operating Characteristic (ROC) curves and Area Under ROC values (in brackets), as shown in Fig. 3 (d). These results show a significant advantage of MC-ΔLDA over unsupervised LDA for detecting rare topics. 3.2
Learning Rare Behaviours in Public Space CCTV Videos
We validated the proposed framework using two road traffic video datasets: the MIT dataset [3] (30Hz, 720 × 480 pixels, 1.5 hour) and the QMUL dataset [4] (25Hz, 360 × 288 pixels, 1 hour). Both scenes featured a large number of objects exhibiting complex behaviours concurrently (see Fig. 4 for typical behaviours). To obtain the visual word codebook, the videos were spatially quantised into 72 × 48 cells and 72 × 57 cells respectively, and motion directions within each cell quantised into 4 orientations (up, down, left, right) resulting in codebooks of 13824 and 16416 words respectively. Each dataset was temporally segmented into non-overlapping short video clips of 300 frames each. In each dataset, a number of clips containing two types of rare behaviours were manually identified and labelled into rare categories. The numbers of samples for each behaviour category are detailed in Table. 1. In the MIT dataset, these
Learning Rare Behaviours
(a) MIT Typical
(d) QMUL Typical
301
(b) Rare 1: left-turn (c) Rare 2: right-turn
(e) Rare 1: U-turn
(f) Rare 2: collision
Fig. 4. Typical and rare behaviours in the experimental datasets Table 1. Number of clips used in the experiments MIT dataset QMUL dataset Typical (400) Rare 1 (26) Rare 2 (28) Typical (200) Rare 1 (12) Rare 2 (5) Training Testing
200 200
10 16
10 18
100 100
4 8
2 3
are vehicles turning left from below and turning right from the side (Fig. 4 (b) and (c), red arrows). In the QMUL dataset (Fig. 4 (e) and (f), red arrows), the rare behaviours are U-turns at the centre of the scene, and short-cut driving pattern (dangerous) in which vehicles drive into the junction before completion of normal driving patterns (yellow and green arrows), risking a collision. Learning to Model Known Rare Behaviours. We first employed MCΔLDA to learn rare behaviours in the MIT dataset. Here we used 20 topics to represent typical behaviours and 1 topic for each type of rare behaviour. Fig. 5 (a) and (b) illustrate the dominant words in the learned topics for rare behaviours ‘left-turn’ and ‘right-turn’. The learned topics accurately describe the known rare behaviours illustrated by the examples in Fig. 4 (b) and (c). This is despite (1) variations in the spatial and temporal positioning of the rare activities within each training clip, i.e. the rare class generalisation is good; and (2) overwhelming co-occurring typical behaviours and heavily shared visual words between typical and rare behaviours (Fig. 5 (c)), i.e. the model was able to learn rare behaviours despite very biased training data. Discovering rare topics in the QMUL dataset is harder because fewer training samples are available (e.g. 2 for Rare 2). Also the objects were highly occluded and motion patterns were frequently broken. For example, to complete a U-turn, vehicles usually drive to the central area and wait for a break in oncoming
302
J. Li et al.
(a) Rare behaviour 1 (b) Rare behaviour 2
(c) Typical behaviours Fig. 5. Learned MC-ΔLDA topics in the MIT dataset for the rare behaviours of interest (a) and (b), and example topics of typical behaviours (c). Colour indicates motion direction: red →, green ↑, blue ←, yellow ↓.
(a) Rare behaviour 1
(b) Rare behaviour 1
(c) Typical behaviours Fig. 6. Learned MC-ΔLDA topics in the QMUL dataset for the rare behaviours of interest. Colour indicates motion direction: red →, green ↑, blue ←, yellow ↓.
traffic before continuing. Employing one topic for each rare behaviour is therefore insufficient, and we applied two topics for each in this dataset. The results are shown in Fig. 6. Interestingly, rare topics are composed mostly of words from existing typical topics, but co-occurring in an unusual and unique combination. These uniquely composed rare topics will facilitate rare clip detection, because the correct rare model (c > 0) will be able to explain the rare behaviour with the use of a single rare behaviour (up to two topics), whereas the normal model (c = 0) will have to use many normal topics to explain the rare behaviour, thus obtaining lower likelihood (Eq. (5)).
Learning Rare Behaviours
(a) MIT rare 1
(b) MIT rare 2
(c) QMUL rare 1
303
(d) QMUL rare 2
Fig. 7. ROC curves for identifying rare behaviours in unseen data
Detecting and Classifying Known Rare Behaviours. We compared the performance of our MC-ΔLDA model on classifying known rare behaviours in unseen data with that of an alternative LDA classifier [8] (LDA-C), in which independent LDA models were learned using clips from each class of training data (see Table 1). For LDA-C, we used 8 topics for each class of clips making the total number of learned topics (24 topics) nearly the same as our MC-ΔLDA model. Classification was performed using the class posterior obtained from the individual LDA model likelihoods and corresponding priors. We first compared the performance of separating clips in one specific class from the others (i.e. detection). We varied the prior p(c|ε) of the target class from 0 to 1 to produce ROC curves and Area Under ROC values. We randomly selected training and testing datasets in three folds and averaged the ROC curves. The results shown in Fig. 7 illustrate clearly the superior performance of MC-ΔLDA over LDA-C. While ROC curves are useful for evaluating detection performance, the simultaneous classification performance among all classes is not measured, which is important as perfect classification accuracy of one specific class could be at the expense of significant mis-classification among other classes. We thus also evaluated full classification performance, using confusion matrices. Note that setting different values of class prior p(c|ε) determines classification performance for both models. In practice, low false alarm rate is critical for an automated rare behaviour detection system and it is a common criterion for setting parameters. We therefore set the priors to values that give a 10% false alarm rate for both models, and the classification performance is compared in Fig. 8. It can be seen clearly that given a low false alarm rate, our MC-ΔLDA yields a superior classification rate for the rare behaviours, with many fewer rare clips mistakenly classified as typical, leading to much higher true positive rate (77% vs. 23% and 73% vs. 28% for MIT and QMUL datasets respectively). The inaccurate classification using LDA-C is the result of learning separate LDA models from different classes of corpa without sharing visual features. As a result, topics for the same type of behaviours would be represented differently (see the last two columns in Fig. 9). This duplicated representation of topics introduced noise to the classification. In contrast, this source of noise was removed in our MC-ΔLDA model, which jointly learned the exactly same topic representation for typical behaviours across all clips. Furthermore, as shown in Fig. 9, among the 8 learned topics there are topics that correspond to rare behaviour.
304
J. Li et al.
(a) MIT dataset
(b) QMUL dataset
Fig. 8. Classification comparison given a 10% false alarm rate
Fig. 9. Example topics learned using LDA-C on MIT dataset. Topics in the first column correspond to rare behaviours and the others correspond to typical behaviours.
However, a key difference between LDA-C and our MC-ΔLDA is that the former is not a multi-instance learning method – it solely learns a bag/clip classifier without automatically learning exactly which topics correspond to rare behaviours and thus should be relied upon for detection. In contrast, association between rare topics and rare behaviours was automatically learned using MC-ΔLDA and therefore this resulted in superior classification performance. We also compared the performance of MC-ΔLDA with unsupervised LDA for behaviour detection by treating clips with any type of rare behaviour as outliers. To do so, an unsupervised LDA model with 20 topics was learned using only typical behaviour clips and the normalised likelihood of an unseen clip was used for detection. The results are shown in Fig. 10. Again MC-ΔLDA significantly outperformed LDA in both datasets. Unsupervised LDA was better at detecting rare behaviours in the QMUL scene than in the MIT scene. This is because that, although cluttered, the rare behaviours in the QMUL dataset were caused by irregular co-occurrences of visual words from multiple objects which is exactly what LDA is designed for. The MIT scene is harder for LDA because visual features of rare behaviours were very subtle and overwhelmed by surrounding typical topics. In contrast, MC-ΔLDA reliably learned a precise description of all rare behaviours in both scenes making them easily detectable.
Learning Rare Behaviours
(a) MIT dataset
305
(b) QMUL dataset
Fig. 10. Comparing MC-ΔLDA with unsupervised LDA for rare behaviour detection
(a) Conflict regions (b) Syn. topic 1
(c) Syn. topic 2
(d) Syn. topic 3
Fig. 11. Conflict regions and synthesised topics
(a) Examples of unknown behaviours
(b) ROC curves for detection
Fig. 12. Comparison for detecting unknown behaviours in unseen data
Detecting Unknown Rare Behaviours. Finally we conducted an experiment to illustrate the effectiveness of the MC-ΔLDA model on detecting unknown rare behaviours by synthesising topics using prior contextual knowledge. Using the method described in Sec. 2.5, three confict regions in the MIT scene with the potential for collision were automatically identified (Fig. 11 (a)). The three synthesised topics generated by our model are shown in Fig. 11 (b)-(d) respectively. Interestingly, these synthesised topics are in fact highly plausible showing the potential collisions either between vehicles and pedestrians or among vehicles. We used 11 clips containing different types of unknown interesting behaviours, including ‘U-turn’, ‘Jay-walking’, ‘Vehicle-pedestrian near collision’ to evaluate the performance (see Fig. 12 (a)). MC-ΔLDA with synthesised topics was compared against two detection methods: (1) the normalised marginal likelihood of
306
J. Li et al.
MC-ΔLDA over all three known classes; (2) unsupervised LDA using normalised likelihood. Our results in Fig. 12 (b) show significant improvement by the proposed new method. Indeed, using MC-ΔLDA likelihoods for detection bears the exactly same drawback as unsupervised LDA – of being less sensitive to subtle behaviours without distinctive visual features. In contrast, MC-ΔLDA with synthesised topics incorporating prior contextual knowledge about risky/interesting potential local behaviour conflicts is more sensitive to less obvious unknown rare behaviours.
4
Conclusion
We introduced Multi-Class Delta Latent Dirichlet Allocation (MC-ΔLDA), a novel approach for learning to detect and classify rare behaviours in video with weak supervision. By constraining a topic model prior depending on a video clip class, we are able to learn explicit models of rare behaviours jointly with “background” typical behaviours. Subsequently, by using importance sampling to integrate out the topics of a new clip under this prior, we can accurately determine the behaviour class of an unseen clip. This is O(SN Nc ) for S samples, N words and Nc behaviour classes, which was real-time for the MIT dataset. Standard unsupervised outlier detection methods [3–5] often fail to detect rare behaviours because each clip contains mostly irrelevant typical features. In contrast to those approaches, the proposed method has two further advantages: the ability to classify rare behaviours into different classes, and the ability to locate the offending sub-regions of the scene. With conventional supervised classification, lack of rare class data prevents good model learning. Our approach, which can be considered as a supervised multi-instance learning generalisation of LDA, mitigates these problems effectively by performing joint localisation and detection during learning. The data are used efficiently, as typical behaviour features are shared between all classes, and the few rare class parameters in the model focus on learning the unique features of the associated rare classes. Compared to LDA-C, this sharing of typical features improves performance by eliminating a source of classification noise (different typical representation), and ensuring each rare behaviour is well modeled (via the constrained prior), even in the extreme case of only a single rare class training example. Important areas for future research include generalising MC-ΔLDA to online adaptive learning for adapting to changing data statistics, and active learning where clip labels can be incrementally and intelligently queried to discover new behaviour classes of interest and learn to classify them. Finally, we proposed a novel although partial solution to the issue of detecting new/unknown rare behaviours, using very general prior knowledge about the scene to synthesise new topics that predict new behaviours (Sec. 2.5). A major area of future research is to formulate a more general model for transfer learning of existing rare behaviour topics to predict new and unknown rare behaviour topics.
Learning Rare Behaviours
307
References 1. Xiang, T., Gong, S.: Video behavior profiling for anomaly detection. PAMI 30, 893–908 (2008) 2. Saleemi, I., Shafique, K., Shah, M.: Probabilistic modeling of scene dynamics for applications in visual surveillance. PAMI 31, 1472–1485 (2009) 3. Wang, X., Ma, X., Grimson, E.: Unsupervised activity perception by hierarchical bayesian models. PAMI 31, 539–555 (2009) 4. Li, J., Gong, S., Xiang, T.: Global behaviour inference using probabilistic latent semantic analysis. In: BMVC (2008) 5. Hospedales, T., Gong, S., Xiang, T.: A markov clustering topic model for behaviour mining in video. In: ICCV (2009) 6. Xiang, T., Gong, S.: Beyond tracking: Modelling activity and understanding behaviour. IJCV 61, 21–51 (2006) 7. Robertson, N., Reid, I.: A general method for human activity recognition in video. Computer Vision and Image Understanding 104, 232–248 (2006) 8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003) 9. Viola, P., Platt, J., Zhang, C.: Multiple instance boosting for object detection. In: NIPS (2005) 10. Nguyen, M.H., Torresani, L., de la Torre, F., Rother, C.: Weakly supervised discriminative localization and classication: a joint learning process. In: ICCV (2009) 11. Blei, D., McAuliffe, J.: Supervised topic models. In: NIPS, vol. 21 (2007) 12. Andrzejewski, D., Mulhern, A., Liblit, B., Zhu, X.: Statistical debugging using latent topic models. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 6–17. Springer, Heidelberg (2007) 13. Markou, M., Singh, S.: A neural network-based novelty detector for image sequence analysis. PAMI 28, 1664–1677 (2006) 14. Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: KDD, pp. 504–509 (2006) 15. Minka, T.P.: Estimating a dirichlet distribution. Technical report, Microsoft (2000) 16. Wallach, H., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: ICML (2009) 17. Buntine, W.: Estimating likelihoods for topic models. In: Zhou, Z.-H., Washio, T. (eds.) ACML 2009. LNCS, vol. 5828, pp. 51–64. Springer, Heidelberg (2009) 18. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101, 5228–5235 (2004)
Character Energy and Link Energy-Based Text Extraction in Scene Images Jing Zhang and Rangachar Kasturi Department of Computer Science and Engineering, University of South Florida {jzhang2,R1K}@cse.usf.edu
Abstract. Extracting text objects from scene images is a challenging problem. In this paper, by investigating the properties of single characters and text objects, we propose a new text extraction approach for scene images. First, character energy is computed based on the similarity of stroke edges to detect candidate character regions, then link energy is calculated based on the spatial relationship and similarity between neighboring candidate character regions to group characters and eliminate false positives. We applied the approach on ICDAR dataset 2003. The experimental results demonstrate the validity of our method.
1
Introduction
Text objects occurring in images contain much semantic information related to the content of images. Therefore, if this text information can be extracted efficiently, we can overcome the semantic gap [1] and provide a more reliable content-based access to the image data. Although many approaches have been proposed in the literature, text extraction is still a challenging problem due to various text sizes and orientations, complex backgrounds, varying lighting, and distortions. In recent years, many supervised text extraction approaches using Support Vector Machine (SVM), Boosting algorithms, Artificial Neural Network (ANN), classifier fusion techniques, and so on have been proposed to separate text from other objects. However, the quality and quantity of training samples utilized in these supervised methods may affect their performances significantly. Typically more than five thousand of training images are needed to train the classifiers, but the insufficiency of the training samples from different lightings, distortions, and languages may still limit their applications. In this paper, we propose a new unsupervised text extraction approach based on character energy computed using the similarity of stroke edges and link energy computed using the spatial relationship and similarity between neighboring candidate characters. The advantages of the proposed method are: (1) Text detection is based on the inherent properties of stroke edges of characters, hence it is robust to the size, color, and orientation of text and can discriminate text from other objects efficiently; (2) Based on the similarity between every two neighboring candidate characters, the properties of text objects are well explored and can be utilized to group characters into text objects and remove false positives. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 308–320, 2011. Springer-Verlag Berlin Heidelberg 2011
Character Energy and Link Energy-Based Text Extraction in Scene Images
309
The rest of the paper is organized as follows: Section 2 includes a review of the related work. Section 3 describes and analyzes the proposed text detection approach. Experimental results are given in Section 4. Finally, we draw conclusions in Section 5.
2
Related Work
Many papers on text extraction have appeared in recent literature. Chen et al. [2], Jung et al. [3], and Zhang et al. [4] present comprehensive surveys of the text extraction methods. Generally, unsupervised text extraction approaches work in a bottom-up way by separating an image into small regions and then grouping character regions into text regions. The different region properties between text and background are utilized to extract text objects Shivakumara et al. [5] propose an algorithm to detect video text for low and high contrast images, which are classified by analyzing the edge difference between Sobel and Canny edge detectors. After computing edge and texture features, low-contrast and high-contrast thresholds are used to extract text objects from low and high contrast images separately. Liu et al. [6] use an intensity histogram based filter and an inner distance based shape filter to extract text blocks and eliminate false positives whose intensity histograms are similar to those of their adjoining areas and the components coming from the same object. Subramanian et al. [7] use character stroke to extract text objects. Three line scans defined as a set of pixels along the horizontal line of an intensity image are combined to locate flux regions and flux points, which show significant changes of intensity. Character strokes are detected using spatial and color-space constraints to find the regions with small variation in color. In extraction step, Lee et al. [8] use a texture feature based text localizer to find dominant text and background colors. After that, by taking each dominant color as a cluster center, a K-means algorithm is adopted to generate candidate character regions. In verification step, the MRF model combined with the co-linearity weight is used to separate text and non-text regions. Based on Ittis visual attention model, Sun et al [9] first compute Visual Attention (VA) map to get the Region of Interest, then Sobel edge detector and connected component analysis are used to extract candidate character regions. Finally, VA map is used as a filter to identify text regions. Our previous work [10] uses the idea of histogram of oriented gradient to capture the similarity of stroke edges. The range of gradient orientation is divided into 4 bin and the histograms of 4 bins are compared to detect candidate character regions. Graph spectrum is used to group and verify the text regions. Compared with [10], the proposed approach computes not only the length similarity of stroke edges but also curvature and stroke width. Therefore it achieves better performance. Recently, we came across a paper [11] presented by Epshtein et al. while this paper was under review, which use a stroke width transform based method to detect text. Both [13] and our approach use the gradient vector of an edge points
310
J. Zhang and R. Kasturi
to find a corresponding edge point and the distance between the two points are considered as a stroke width. However, closed boundaries are not detected in [13], hence only the consistency of stroke width is used to extract candidate character regions, and the corresponding points must be calculated twice in order to capture bright text on dark background and vice-versa. Our approach is based on closed boundaries, and besides stroke width, additional feature, such as the average difference of gradient directions of a closed boundary, is used for text detection, and corresponding points are found in one step.
3
Proposed Approach
The proposed text extraction consists of two steps. First, based on the calculated properties of characters, we compute character energies to localize candidate character regions. Second, by assuming text objects always contain more than one character, we compute link energies between every two neighbor candidate character regions to group characters and remove false positives. 3.1
Character Energy Based Text Detection
Extracting good text features which can discriminate text from other objects is a critical and challenging problem for text detection. Some commonly used features include color, contrast, aspect ratio, edge intensity, stroke straightness, stroke width, histogram of orientated gradient, and so forth. In this section, by exploring the properties of characters, we compute the proposed character energy based on the similarity of stroke edges. The basic idea of the proposed approach is based on the two observations: (1) the boundary of a character is typically closed, because variations in the intensity of characters are small and the contrast between characters and background is typically high; (2) the edges of a character typically appear as pairs, that is, the boundary of a character can be considered as a union of two edge sets, S1 and S2 , and the edges in S1 have similar lengths, orientations, and curvatures to the edges in S2 . We take Fig. 1 as an example to illustrate the two observations. Four characters in Fig. 1-a have closed boundaries, and all boundaries are composed by two edge sets shown in blue color and red color in Fig. 1-b. We can see that the red edges and blue edges have similar properties. Taking advantage of the above two observations, we are able to detect candidate character regions in an image. First, the closed boundaries are extracted
(a)
(b)
Fig. 1. The edges of strokes shown in blue and red colors
Character Energy and Link Energy-Based Text Extraction in Scene Images
311
by a zero-cross algorithm based edge detector. The unclosed boundaries and the closed boundaries that are too big or too small are considered as non-text boundaries and erased from the edge map. The regions within the closed boundaries (excluding holes) are labeled as 8-connected objects. Second, for each labeled object, we calculate the gradient vector for every boundary point, based on which the boundary can be divided into two sets. Assume that the gradient angle of a boundary point P is θ, if 0 ≤ θ < π, P belongs to the set S1 ; if π ≤ θ < 2π, P belongs to the set S2 . In Fig. 2-a and 2-d, the blue arrows show the gradient vectors of character ‘A’ and first ‘C’ in Fig.1-a. The boundary points in S1 and S2 are indicated by black asterisks and red circles respectively. Fig. 2-b and 2-e are the close-ups of the regions bounded by green boxes in Fig. 2-a and 2-d. Third, we find the corresponding point for each boundary point. Given an arbitrary point x in S1 (S2 ) and its gradient vector Vx , we are able to find a point y in S2 (S1 ) along the direction of Vx using a straight-line equation. We call y the corresponding point of x. One pair of corresponding points (x and y) are shown in Fig. 2-b. Note that the corresponding point of y may or may not be x. In Fig. 2-e, the corresponding point of x is y, but the corresponding point of y is x . All corresponding points of ‘A’ and ‘C’ are connected by red lines in Fig. 2-c and 2-f. As mentioned before, the stroke edges of a character in sets S1 and S2 have similar length, orientation, and curvature, based on which we can conclude that: (1) S1 and S2 of a character should have similar numbers of boundary points, e.g., in Fig. 2, the number of black asterisks is close to the number of red circles.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. Gradient vectors and corresponding points
312
J. Zhang and R. Kasturi
(2) The gradients of a point and its corresponding point should have approximately opposite directions, e.g., in Fig. 2, the directions of blue arrows of most corresponding pairs are approximately opposite. (3) The distances between the points and their corresponding points have similar values, because the change of the stroke width is usually small, e.g., in Fig. 2, most red lines have similar lengths. Consequently, we can define three features shown below. First two features measure the similarity of stroke edges and are used to compute character energy of an object. The third feature is used to measure the similarity between two objects in Section 3.2. (1) Averaged Difference of Gradient Directions (Gangle ): Let N denote the number of boundary points of an object. P (i) is the ith point with the corre(i) (i) sponding point corr(P (i) ). θg and corr(θg ) denote the gradient directions of (i) (i) P and corr(P ). The difference of gradient directions of P (i) is defined as: (i)
Gangle = abs(π − abs(θg(i) − corr(θg(i) )))
(1)
Clearly, when the gradient direction of P (i) is opposite to the gradient direction (i) of corr(P (i) ), Gangle = 0. Accordingly, Gangle is defined as: N
Gangle =
i=1
(i)
Gangle N
(2)
We can see that Gangle measures the averaged difference of gradient directions of all pairs of corresponding points of an object. Smaller Gangle means the higher similarity of stroke edges. (2) Fraction of Corresponding Pairs (Fcorr ): Let α be a pre-defined angle threshold. Fcorr is defined as below: N
Fcorr = where (i)) h(Gangle , α)
i=1
(i)
h(Gangle , α) N
(i) 1 if Gangle ≤ α = 0 else
(3)
(4)
We can see that Fcorr is a fraction indicating the number of corresponding pairs whose gradient differences are less than α to the number of all corresponding pairs. Bigger Fcorr means the higher similarity of stroke edges. (3) Vector of Stroke Width (Vwidth ): Let d(i) represent the distance between (i) P and corr(P (i) ). Each d(i) can be considered as a possible stroke width. Let D = {d(1) , · · · , d(N ) } and hist(D) be the histogram of D. For the characters of a text object, typically there is a dominating stroke width. Therefore they should
Character Energy and Link Energy-Based Text Extraction in Scene Images
(a) ‘A’
(b) Fist ‘C’
(c) Second ‘C’
313
(d) ‘V’
Fig. 3. The histograms of stroke widths of ‘ACCV’ in Fig. 1-a
have similar hist(D) with one sharp peak as seen in Fig. 3 which is hist(D) of ‘ACCV’ in Fig. 1-a. Let hist(D, j) be the value of the histogram of D at the j th (1 ≤ j ≤ J) bin. J is set to the maximum thickness of a character stroke. We define: Wm = argmax(hist(D, j))
(5)
1≤j≤J
) hist(D, Wm (6) N We can see that Wm is the bin where hist(D) reaches the maximum, hence it can be considered as the dominating stroke width; Cm is a fraction of the number of edge pixels having dominating width to the total number of edge pixels. An improvement can be made as follow by using the two neighboring bins to provide more robust results: Cm =
Cm = w1 × (Cm − 1) + w2 × Cm + w3 × (Cm + 1)
(7)
where, w1 , w2 , and w3 are the weights. We define the vector of stroke width as: Vwidth = [Wm , Cm ]
(8)
Because the stroke width is obtained by computing the distances of all pairs of corresponding points, it is much more accurate than the stroke width calculated by line scans and robust to the size and rotation. Table 1 lists the values of Gangle , Fcorr , and Vwidth (w1 = 0.75, w2 = 1, and w3 = 0.75) of ‘ACCV’ in Fig. 1-a. As expected, the four characters have small Gangle values, high Fcorr values, and the similar Vwidth values. Table 1. Three features of ‘ACCV’ in Fig. 1-a
Gangle Fcorr Vwidth
‘A’
‘C’
‘C’
‘V’
14.6635 0.6828 [8, 0.6565]
8.1241 0.9102 [8, 0.8756]
8.2085 0.8857 [8, 0.8859]
11.3858 0.7990 [8, 0.8192]
314
J. Zhang and R. Kasturi
(a) Original Image
(b) Gangle
(c) Fcorr
(d) Original Image
(e) Gangle
(f) Fcorr
Fig. 4. Gangle and Fcorr of two example images
To demonstrate the abilities of Gangle and Fcorr on discriminating text from other objects, two examples are shown in Fig. 4. Fig. 4-a and 4-d are original images from ICDAR text locating dataset 2003. Gangle and Fcorr of two images are shown as intensity images in Fig. 4-b, 4-e, 4-c, and 4-f. For all characters, Gangle are less than 15◦ degree, and Fcorr are bigger than 0.65. Based on Gangle and Fcorr , we define characterenergy, EChar , as: EChar = (
π − Gangle + Fcorr )/2 π
(9)
Therefore, 0 ≤ EChar ≤ 1, and when Gangle = 0 and Fcorr = 1.0, EChar reaches the maximum value 1. An object with bigger EChar has higher probability to be a character. Fig. 5 shows EChar and thresholded images by eliminating the objects whose EChar < T (T = 0.75). As illustrated in Fig. 4 and Fig. 5, we can see that the proposed two text features, Gangle and Fcorr , can discriminate the text objects from other objects efficiently based on the similarity of stroke edges, and all characters in Fig. 5 have higher EChar . Meanwhile, because the similarity of stroke edges is an inherent property of characters, the presented text detection method is robust to the font, size and orientation of characters.
Character Energy and Link Energy-Based Text Extraction in Scene Images
(a) Echar
(b) Echar ≥ 0.75
(c) Echar
315
(d) Echar ≥ 0.75
Fig. 5. Echar and thresholded results
3.2
Link Energy Based Text Grouping
The approach described in Section 3.1 uses the properties extracted from a single character to detect candidate character regions. Besides that, we know the characters of a text object typically have similar properties, such as color, size, font, orientation, and the interval between characters, which can be utilized for text detection. In this section, by assuming that a text object always contains more than one character, we investigate the spatial relationship and the property similarity between two neighboring regions to compute link energy, which can be used to group characters to form text objects and remove false positives. First, we find the neighbors of a candidate character region. Let Ri and Rj be two candidate character regions with width WRi and WRj , height HRi and HRj , and centroids CRi and CRj . If dist(CRi , CRj ) ≤ wd · min(max(WRi , HRi ), max(WRj , HRj ))
(10)
where dist(CRi , CRj ) is the distance between CRi and CRj , wd is a weight threshold. We say Rj is a neighboring region of Ri . Second, after obtaining the neighbors of every candidate region, we compute link energy, ELink , based on the similarity between the two connected regions. The similarity is calculated based on the following features: color, width, height, aspect ratio, as well as stroke width (Swidth ) computed in Section 3.1. Therefore, ELink(i,j) between Ri and Rj is defined as: ELink(i,j) = f eature(n)
5 1 f eature(n) wn · Simil(i,j) 5 n=1
(11)
Where Simil(i,j) is the similarity of nth feature between Ri and Rj , feature={color, width, height, aspect ratio, stroke width}, and wn is the weight. We may let w1 be a small value or 0 to detect the text objects whose characters are in various colors.
316
J. Zhang and R. Kasturi
We use circles to illustrate ELink and spatial relationships of candidate character regions: (1) Every two neighbor regions are connected by a circle; (2) The intensity of a circle indicates the value of ELink and higher intensity means bigger ELink ; (3) The size of a circle indicates the interval between two neighboring regions and bigger circle means longer interval; (4) The color of a circle indicates the direction θ of the line connecting the centroids of two neighboring regions. Blue color for −15◦ < θ < 15◦ , green color for 15◦ ≤ θ < 75◦ , magenta color for −75◦ < θ ≤ −15◦ , and red color for 75◦ ≤ θ ≤ 90◦ or −90◦ ≤ θ ≤ −75◦ .Therefore, the regions that connected by the circles with the same color, similar sizes, and big intensities have high probability to be text objects. Fig. 6-a and 6-c show ELink of two examples. Third, EChar computed in Section 3.1 and ELink are combined to get text energy, ET ext : ET ext = EChar × ELink (12) ET ext denotes the probability of a candidate text region to be a true positive. ET ext of two images in Fig. 5 are illustrated in Fig. 6-b and 6-e. Compared with Fig. 6-a and 6-d, the circles of false positives become much lighter than the circles of text objects, that means the text objects have higher text energies than
(a) ELink
(b) ET ext
(c) ET ext ≥ 0.5
(d) ELink
(e) ET ext
(f) ET ext ≥ 0.5
Fig. 6. Text extraction by using ET ext
Character Energy and Link Energy-Based Text Extraction in Scene Images
(a)
317
(b)
Fig. 7. The bounding boxes of characters and text objects
other objects. The thresholded ET ext images are shown in Fig. 6-c and 6-f with a threshold T = 0.5. We can see that all characters are detected and grouped by the connected circles; meanwhile, all false positives are removed successfully. Finally, the bounding boxes for each character and text object are generated, as shown in Fig.7. The characters localized by EChar are bounded by green boxes and the text objects extracted by ET ext are bounded by blue boxes.
4
Experimental Results
To evaluate the performance of the proposed method, we applied it on ICDAR 2003 text locating dataset. Fig. 8 shows some detection results and the corresponding text energies. The characters are bounded by green boxes and the text objects are bounded by blue boxes (for light background) or yellow boxes (for dark background). We can see that our method can detect most text objects successfully, including text with different color (Fig. 8-a), different lighting (Fig. 8-d), complex background (Fig. 8-c), different size (Fig. 8-d), different stroke width (Fig. 8-e), flexible surface (Fig. 8-f), and low contrast (Fig. 8-i and 8-j). The testing set of ICDAR 2003 text locating competition dataset has 249 indoor and outdoor images with various resolutions. The detection result of our Table 2. Detection results on ICDAR dataset 2003 Algorithm Ashida[12] HWDavid[12] Wolf[12] Liu[6] Zhang[10] Proposed Method
Precision
Recall
0.55 0.44 0.30 0.66 0.67 0.73
0.46 0.46 0.44 0.46 0.46 0.62
318
J. Zhang and R. Kasturi
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 8. Detection results on ICDAR Dataset 2003
method on this dataset is listed in Table 2 along with the best three algorithms reported in [12] and the algorithms proposed in [6] and [10]. We can see that the proposed text extraction approach has the highest precision rate of 73% and recall rate of 62%. It achieves better performance than other listed algorithms. The limitations of the proposed method are illustrated in Fig. 9. (1) In Fig. 9-a, the seven vertical black bars are marked as one text object incorrectly,
Character Energy and Link Energy-Based Text Extraction in Scene Images
(a)
(b)
(c)
319
(d)
Fig. 9. The limitations of proposed approach
because they have high similarities of stroke edges and are almost the same in width, length, aspect ratio, and color. This limitation can be solved by employing the shape filter proposed in [6] to remove repeated objects. (2) The proposed approach fails to detect the closed boundaries of the text objects in Fig. 9-b due to the transparent background. More color-based and region-based information is needed to solve this limitation. (3) Because we assume that text objects always contain more than one character, the connected characters and single characters, as shown in Fig. 9-c and 9-d, are removed as false positives.
5
Conclusions
In this paper, by analyzing the properties of single characters and text objects, we propose a new text extraction approach based on character and link energies. Character energy measures the probability of an object to be a character based on the observation that the edges of a stroke typically have similar length, direction and curvature. The similarity of stroke edges is computed by using two proposed features, averaged difference of gradient directions and fraction of corresponding pairs. Link energy measures the probability of neighbor candidate character regions to be a text object. It is computed based on the spatial relationship and the similarities of color, width, height, aspect ratio, and stroke width between every two neighboring regions. Text energy is the product of character energy and link energy and is used to detect text objects. We use circles to indicate detected text objects. The regions that are connected by the circles with the same color, similar sizes, and high intensities are marked as text objects. We evaluated the proposed text detection method on ICDAR text locating dataset 2003. The detection results demonstrate that our approach performs better than the algorithms in [6],[10] and [12]. Our future work is to remove the limitations discussed in Section 4 and apply the proposed approach on videos.
References 1. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. PAMI 22(12), 1349–1380 (2000) 2. Chen, D., Luettin, J., Shearer, K.: A survey of text detection and recognition in images and videos. Inst. Dalle Molled Intelligence Perceptive Research Report, 00-38 (2000)
320
J. Zhang and R. Kasturi
3. Jung, K., Kim, K.I., Jain, A.K.: Text Information extraction in images and video: a survey. Pattern Recognition, 977–997 (2004) 4. Zhang, J., Kasturi, R.: Extraction of Text Objects in Video Documents: Recent Progress. In: The Eighth International Workshop on DAS, pp. 5–17 (2008) 5. Shivakumara, P., Huang, W., Phan, T., Tan, C.L.: Accurate Video Text Detection Through Classification of Low and High Contrast Images. Pattern Recognition 43(6), 2165–2185 (2010) 6. Liu, Z., Sarkar, S.: Robust outdoor text detection using text intensity and shape features. In: ICPR (2008), doi:10.1109/ICPR.2008.4761432 7. Subramanian, K., Natajajan, P., Decerbo, M., Castanon, D.: Character-Stroke Detection for Text-Localization and Extraction. In: ICDAR, pp. 33–37 (2007) 8. Lee, S., Cho, M., Jung, K., Kim, J.H.: Scene Text Extraction with Edge Constraint and Text Collinearity. In: ICPR (2010) 9. Sun, Q., Lu, Y., Sun, S.: A Visual Attention based Approach to Text Extraction. In: ICPR (2010) 10. Zhang, J., Kasturi, R.: Text Detection using Edge Gradient and Graph Spectrum. In: ICPR (2010) 11. Epshtein, B., Ofek, E., Wexler, Y.: Detecting Text in Natural Scenes with Stroke Width Transform. In: CVPR (2010) 12. Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: ICDAR 2003 Robust Reading Competitions. In: ICDAR (2003)
A Novel Representation of Palm-Print for Recognition G.S. Badrinath and Phalguni Gupta Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India-208016 {badri,pg}@cse.iitk.ac.in
Abstract. This paper proposes a novel palm-print feature extraction technique which is based on binarising the difference of Discrete Cosine Transform coefficients of overlapping circular strips. The binary features of palm-print are matched using Hamming distance. The system is evaluated using PolyU database consisting of 7, 752 images. A procedure to extract palm-print for PolyU dataset is proposed and found to extract larger area compared to preprocessing technique in [1]. Variation in brightness of the extracted palm-print is corrected and the contrast of its texture is enhanced. Compared to the systems in [1, 2], the proposed system achieves higher Correct Recognition Rate (CRR) of 100 % with lower Equal Error Rate (EER) of 0.0073% at low computational cost.
1
Introduction
Biometric based identification/verification system has found its applications widely in commercial and law enforcement applications. Systems based on fingerprint features are most widely used while those based on iris features are considered to be the most reliable [3]. Palm-print is the region between wrist and fingers, and relatively a new biometric trait. It has features like texture, wrinkles, principle lines, ridges and minutiae points that can be used for its representation [4]. like other biometric traits for identification/verification, palm-print is also believed to have the critical properties like universality, uniqueness, permanence and collectability [5]. Using palm-print over other biometric technologies has several advantages as follows. Palm-print has relatively stable and unique features. Collection of data is easy and non-intrusive. Co-operation from the user to collect data is low. Devices to collect data are economic. It provides high efficiency using low resolution images. Systems based on palm-print have high user acceptability [5]. Furthermore, a palm-print also serves as a reliable biometric trait because the print patterns are found to be distinct even in mono-zygotic twins [6]. Recently palm-print based recognition system has received wide attention from researchers and efforts have been made to build a palm-print based recognition system based on structural features of palm-print like crease points [7], line features [8] [9], datum points [10] and local binary pattern histograms [11]. There exist systems based on statistical features of palm-print extracted using R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 321–333, 2011. Springer-Verlag Berlin Heidelberg 2011
322
G.S. Badrinath and P. Gupta
Palm-print Image
Extract Region of Interest
Database
Decision
Feature Extraction
Enhancement
Matching
Enrol
Enrol/ Verify/ Identify
Verify / Identify
Fig. 1. Block Diagram of the Steps Involved in Building the Proposed Palm-print based Recognition System
Fourier transforms [12], Independent Component Analysis [13], Karhunen-Lowe transforms [14], DCT [15] [16], Wavelet transforms [17], Stockwell Transform [18], Gabor filter [1], Fisherpalm [19], Speeded Up Robust Features [20], Zernike moments [21], Fusion Code [22], Competitive Code [23] and Ordinal Code [2]. Furthermore, there exist multi-modal biometric systems which are developed with other traits like fingerprint [14], hand geometry [24] and face [25] to improve the accuracy. Most of the work reported in literature are evaluated using the Polytechnic University of Hong Kong (PolyU) database [26] and have used the preprocessing technique proposed in [1] which extracts 128 × 128 size palmprint. However, there exist few major issues in this field. Some of them are the information in the palm-print that distinguishes one class from others. The methods of reduction of the feature extraction and matching complexity for practical applications, particularly for identification where matching has to be performed with many users in the database. This paper presents an efficient palm-print based recognition technique. At the preprocessing stage it selects a larger area of interest, as compares to previous known methods for feature extraction. Variation of DCT coefficients from overlapping circular strips are used to generate features from palm-print. These features are considered for matching. The paper is organised as follows. Next section describes Discrete Cosine Transform (DCT) which is used to extract features from the palm-print region. Section 3 discusses the proposed method to extract the region of Interest (ROI) and the ROI is enhanced to get the good quality of features. Section 4 discusses the method used to extract features while matching strategy has been presented i the next section. The experimental results are analysed section 6. Conclusions are given in the last Section.
2
The Discrete Cosine Transform
The Discrete Cosine Transform (DCT) [27] is a real valued transform with very low computation complexity and has variance distribution closely resembling to that of the Karhunen-Loeve transform (KLT). It has strong energy compaction
A Novel Representation of Palm-Print for Recognition
323
property and widely used for Image compressions. Further, the fast computation techniques [28] of DCT has made its use for pattern recognition [29] [15], [16], [30]. It is found to produce promising results for face and iris recognition. Due to its lower computation complexity relative to KLT, it is considered as a candidate technique in lieu of KLT for face recognition [29]. There is no transform which could be considered as optimal for palm-print feature extraction and recognition. However, these properties have motivated to use DCT as the effective and efficient technique for non-conventional feature extraction. There are several variants, but most commonly used operates on a real sequence An of length W to generate co-efficient CTk [27] as. W −1 π 2 1 An cos CTk = s(k) n+ k , N W 2 n=0
(1)
k = 0, . . . , W − 1 and An =
W −1 k=0
π 1 s(k)CTk cos n+ k , N 2
(2)
n = 0, . . . , W − 1 √ where s(k) = 1/ 2 , k = 0 and s(k) = 1, 1 ≤ k ≤ W − 1.
3
Extraction of Region of Interest and Enhancement
The process starts with preprocessing and extracting central region of interest (ROI) from the palm-print images obtained. Contrast of texture and principle lines of the extracted ROI which play an important role in a palm-print based recognition system has been improved. The enhanced ROI is segmented into overlapping circular strips and features are generated from the difference of DCT coefficients of adjacent strips in feature extraction module. Hamming distance is used to match the feature vectors of live and enrolled palm-print images in the database. The block diagram of the proposed palm-print based recognition system using variation of local DCT coefficients is given in Fig. 1. This paper uses palm-print database of the Polytechnic University of Hong Kong (PolyU) [26] which consists of 7,752 images obtained from 193 users corresponding to 386 different palms. Around 20 samples per palm are collected in two sessions over a period of two months. The images are collected using CCD [1] at spatial resolution of 75 dots per inch and 256 gray-levels. The light source and camera focus have also been changed to test the robustness of any technique. Images are captured placing pegs employed to restrict rotational and translational invariance between inter class image. Though restricted, small amount of rotation and translation has been found among the intra class images. Fig. 2a shows the sample of database.
324
G.S. Badrinath and P. Gupta
(a)
(d)
(b)
(e)
(c)
(f)
Fig. 2. (a) Sample Palm-print Image (b) Binarized Image (c) Palm-print Contour and Reference Points (d) Relevant Points and Region of Interest (e) Region of Interest in gray-scale palm-print image (f) Extracted Region of Interest
Since the background and palm-print are contrasting in colour, palm-print image is binarised using global thresholding. The binarised palm-print image is shown in Fig. 2b. The contour of the palm-print image is obtained by using contour tracing algorithm [31] on binarised palm-print image. Two valley points (V 1, V 2) between fingers are detected on the contour of the hand image as shown in Fig. 2c. Square area as shown in Fig 2d with two of its adjacent points coinciding the mid-points M 1 and M 2 of line segments V 1 − P 1 and V 2 − P 2 respectively is considered as ROI for feature extraction. The line segments V 1−P 1 and V 2−P 2 are inclined at an angle of 50◦ and 50◦ to line joining V 1 − V 2 respectively. Empirically, it is observed that line segments V 1 − P 1 and V 2 − P 2 inclined at 50◦ would extract larger ROI and extracted ROI is within palm-print region. Fig. 2e shows the ROI in gray-scale palm-print image. The extracted region of interest in gray scale is shown in Fig. 2f. Since extracted ROI is relative to the valley points between fingers which are stable for the user, the extracted ROI with different orientation of placement for the same user remains unchanged. Hence the proposed ROI extraction procedure of the system makes the system robust to rotation. The extracted ROI/palmprint1 varies in size from user to user, because area of the palm is different for each user. Experimentally, is found that the height of extracted ROI ranges from 170 to 180 pixels. The extracted ROI is larger than the 128 × 128 palm-print 1
Palm-print and ROI are used interchangeably.
A Novel Representation of Palm-Print for Recognition
(a)
(b)
(c)
325
(d)
Fig. 3. (a) Extracted Palm-print. (b) Estimated Coarse Reflection. (c) Uniform Brightness Palm-print Image. (d) Enhanced Palm-print Image.
extracted in [1]. Larger the extracted area, higher the information regarding the user. So there is a possibility of extracting more discriminate features from the extracted ROI. The proposed feature extraction technique needs same number of features for matching. So the extracted ROI images are normalized to exactly same size of 176 × 176 pixels. Due to position of light source and non-uniform reflection from the relatively curvature surface of the palm, the extracted palm-print/ROI does not have uniform brightness. To obtain the well distributed texture image following operations as described in [21] are applied on extracted palm-print. The palm-print is divided into 16 × 16 sub-blocks and mean for each sub-block is computed which provides the estimate of background reflection. The estimated background reflection is expanded using bi-cubic interpolation to original size of palm-print image. Fig. 3b shows the estimated coarse reflection of Fig.3a. Estimated reflection is subtracted from the original image to compensate the variation of brightness across the image, as a result uniform brightness or illumination corrected palm-print image is obtained. Uniform brightness image of Fig. 3a is shown in 3c. Further contrast of the uniform brightness or illumination corrected image is enhanced using histogram equalization on 16 × 16 blocks. The enhanced palm-print is shown in Fig 3d.
Fig. 4. Basic Structure of the Proposed Feature Extraction
Fig. 5. Overlapping Circular Strips and Other Parameters
326
G.S. Badrinath and P. Gupta
Average
Hanning Window
DCT
CircularStrip
FeatureVector
DCTCoefficients Difference
DCTCoefficients
Fig. 6. Illustrating the Steps Involved in Generating the Feature Vectors of Extracted and Normalised Palm-print
4
Feature Extraction
In this section the proposed feature extraction technique is discussed in detail. The objective of any palm-print feature extraction technique is to obtain good inter-class separation in minimum time. Features should be obtained from the extracted, normalised and then enhanced palm-print for future verification or identification. In the proposed system, features are extracted using non-conventional pattern analysis method. The local variation of DCT coefficients are used as novel method to extract features from palm-print. The extracted and enhanced palm-print is segmented into overlapping circular strips as described in [18]. The circular strip is the circular region of inner radius Ri and outer radius Ro as shown in Fig.4. Circular strip is the basic structure used to extract features of the palm-print in the proposed system. Fig. 5 shows the schematic diagram for segmenting the palmprint and other parameters of the proposed palm-print feature extraction. The reason behind choosing circular strip is that uniform weight is given in all the directions (horizontal, vertical and diagonal). For example, if rectangle structure is chosen, the weights of horizontal and vertical directions are different. Further, it increases number of parameters (length, breadth, orientation of rectangle) to be optimized. Segmented Circular Strip Coding. This sub-section discusses the process of obtaining the binary feature vectors from the overlapping segmented circular strips. The steps involved in generating the feature vectors from extracted circular strips are illustrated in Fig. 6. The segmented circular strip is averaged across its radial direction as defined in Eq. 3 to obtain 1D-intensity signal A and also reduces noise.
A Novel Representation of Palm-Print for Recognition
A(θ) =
Ro
I(xp , yp )
0◦ ≤ θ ≤ 360◦
327
(3)
r=Ri
xp = xc + r × cos(θ) yp = yc + r × sin(θ)
(4) (5)
where I is the palm-print image, xc and yc are center co-ordinates of the circular strip and θ is the radial direction. The radial direction θ is uniformly varied to obtain the 1D-intensity signal A = (A1 , A2 , . . . AW ) of length W . In this case W = 2 × π × Ro is considered. The obtained 1D intensity signal A is windowed using Hanning window and then subjected to DCT to obtain coefficients CTk as defined in Eq 1. Prior to subjecting for the DCT windowing is done to reduce the spectral leakage. The magnitude difference of DCT coefficients from vertically adjacent strips are computed and the obtained results are then binarised using zero crossing. The resulted W -bit binary vector is considered as sub-feature vector and combining sub-feature vectors from all the vertically adjacent strips forms feature vector of palm-print. These W -bit sub-feature vector form a basis of feature vector for the matching process.
5
Matching
The extracted sub-feature vectors from the testing palm-print T est are matched with sub-feature vectors enrolled palm-print Enrol stored in the database for verification or identification. Hamming distance (HD) is used to measure the similarity as follows: M W
HD =
T esti,j ⊕ Enroli,j
i=1 j=1
M ×W
(6)
where M is the total number of sub-features and W is the number of bits in a sub-feature.
6
Experiments and Results
The proposed system is tested for measuring Correct Recognition Rate (CRR) and Equal Error Rate (EER) for identification and recognition mode respectively. The CRR of the system is defined as follows: CRR = N 1/N 2 ∗ 100
(7)
where N 1 denotes the number of correct (Non-False) recognition of palm-print images and N 2 is the total number of palm-print images in the testing set. At a given threshold, the probability of accepting the imposter, known as False Acceptance Rate (F AR) and probability of rejecting the genuine user known as
328
G.S. Badrinath and P. Gupta
False Rejection Rate (F RR) is obtained. Equal Error Rate (EER) which is the defined as F AR = F RR is considered as the important measure for verification. The database is classified into training set and testing set for evaluating the performance. Arbitrarily six images per palm are considered for training and remaining ten images are used for the testing. 6.1
Performance and Tuning Parameters
This sub-section discusses the method of determining various parameters for the best performance of the system for both identification and verification. Since the system has many parameters like outer radius Ro and inner radius Ri of circular strip, horizontal and vertical distance between centers of adjacent circular strips, it is infeasible to test the system for all possible values of all the above four parameters. So parameters are hierarchically tuned for optimal values and the best performance is obtained. Inner Radius and Outer Radius of Circular Strip: Since circular strip is the basic structure of the proposed system, first EER and CRR of the proposed system is computed for non-overlapping circular strips (vertical and horizontal distance equal to outer radius Ro ) for various outer radius Ro and inner radius Ri of the strip. Experiments are conducted to get CRR and EER. From Table 1 Table 1. CRR and EER for tuned Outer Radius Ro and Inner Radius Ri of Circular Strip
Inner Radius Ri 15 16 17
Outer Radius Ro
18 19 20 21 22 23 24 25
CRR EER CRR EER CRR EER CRR EER CRR EER CRR EER CRR EER CRR EER CRR EER CRR EER CRR EER
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
11
12
13
14
15
16
17
18
19
20
99.000 3.0225 99.667 3.0586 99.333 4.1126 99.000 3.1126 99.010 2.3784 99.667 2.3649 98.333 3.482 98.667 3.8874 99.333 3.3378 100.00 2.5901 99.333 2.7117
99.667 3.4865 99.667 2.4009 99.000 3.7207 100.00 3.0676 99.333 2.6802 99.000 2.0541 97.667 3.7027 98.00 4.2793 99.667 3.5721 99.333 3.1892 100.00 3.0135
99.667 3.8108 99.667 2.0315 100.00 4.0991 99.000 3.1712 100.00 2.4234 100.00 2.6036 98.732 4.3964 98.667 4.2207 99.136 3.4099 100.00 2.9279 99.667 2.8378
99.333 2.3874 99.000 3.1441 99.667 2.7432 100.00 1.6139 100.00 1.84756 98.667 1.9414 100.00 3.3604 99.841 3.9324 100.00 2.4459 99.531 2.8243
98.667 3.3739 98.667 3.1982 100.00 1.81158 100.00 1.3468 98.333 2.4144 98.333 3.0901 100.00 3.3874 100.00 3.536 99.759 2.4775
99.333 3.0901 100.00 2.6937 99.571 1.3468 98.333 1.5135 100.00 4.5135 99.333 4.0631 99.667 2.6847 99.333 2.8378
99.667 2.8763 99.333 2.4459 98.667 3.8423 98.667 3.7207 100.00 3.3919 100.00 3.1937 99.333 3.4955
98.667 3.0901 99.000 3.9054 98.00 4.018 99.333 3.8243 100.00 3.3874 99.672 3.1036
99.333 3.8063 99.00 4.0541 99.589 3.6982 99.333 3.1577 98.667 3.0039
99.333 3.4685 99.333 3.4054 99.333 3.036 98.333 2.8108
A Novel Representation of Palm-Print for Recognition
329
it can be observed that system performs best (i.e at maximum CRR (100%) and minimum EER (1.34%) ) for outer radius Ro = 20 along with inner radius Ri = 15. This best performing outer and inner radius are considered as fundamental dimensions of the circular strip for further exploration of remaining parameters. Inter-Strip Distance: The feature vectors are generated from difference of DCT coefficients of vertically adjacent circular strips. If the inter-strip distance is small, more number of feature vectors are obtained. Further, if overlapping is more, similarity between the circular strips is also very high and the generated feature vectors would be similar. Thus, the feature vectors would not be well distributed and performance would be affected. If the inter-strip distance between circular strips increases, number of feature vectors obtained will be less and would be insufficient to discriminate an user from other users. So experiments are conducted by varying vertical and horizontal inter-strip distances keeping fundamental dimensions (Outer Radius and Inner Radius) of circular strip obtained above. CRR and EER for different vertical and horizontal interstrip distance are shown in Table 2. It can be observed that the proposed system performance best for vertical inter-strip distance of 30 with combination of horizontal inter-strip distance of 15. 6.2
Results
The optimal values for the inner radius, outer radius, and inter-strip distance (i.e Vertical and Horizontal distance between adjacent circular strips) for the Table 2. CRR and EER for Tuned Vertical and Horizontal Distance between the Strips
Horizontal Distance 25 26 27
Vertic al Distance
28 29 30 31 32 33 34
CRR (%) EER (%) CRR (%) EER (%) CRR (%) EER (%) CRR (%) EER (%) CRR (%) EER (%) CRR (%) EER (%) CRR (%) EER (%) CRR (%) EER (%) CRR (%) EER (%) CRR (%) EER (%)
11
12
13
14
15
16
17
18
19
20
99.667 1.0183 100.00 1.0045 99.333 1.0092 99.339 1.0135 99.667 1.0580 100.00 1.0225 99.667 1.0082 99.667 1.027 99.667 1.8465 100.00 1.0096
99.667 2.7600 99.667 1.6937 99.333 0.0225 99.617 0.6847 100.00 1.7252 100.00 1.0365 99.667 0.7027 99.667 0.5541 99.667 1.3514 99.667 1.0180
99.333 2.0181 99.333 1.2342 99.667 0.6712 99.627 1.0225 99.667 0.0185 99.667 0.0225 99.667 0.0362 99.667 1.9129 99.667 0.3423 99.333 0.0271
99.667 1.0454 99.333 1.0360 100.00 0.0360 100.00 0.0958 99.667 0.0097 100.00 0.0075 99.667 0.0368 100.00 0.06757 99.333 1.3919 99.333 1.07973
99.667 1.0279 99.333 0.9865 99.333 0.2438 99.668 0.0684 100.00 0.0151 100.00 0.0073 100.00 0.02847 100.00 0.0197 99.667 0.0586 99.667 0.0676
99.333 1.3423 99.667 0.3333 100.00 0.0992 99.662 0.0400 100.00 0.0666 99.333 0.00925 100.00 0.01873 99.333 0.0765 99.667 0.0702 99.667 1.6937
99.667 2.136 99.333 1.6712 99.333 0.6802 99.667 0.7973 100.00 0.4009 100.00 0.0405 99.667 1.0368 99.667 0.6802 99.667 0.0366 99.667 0.9144
99.667 1.3468 100.00 0.7973 99.667 0.3333 99.284 0.6667 99.667 0.3919 99.667 0.0721 99.333 0.0405 99.667 1.1126 99.667 1.0185 99.619 0.4009
99.667 1.3829 99.333 1.3739 99.333 1.0405 99.721 1.0315 99.333 1.0991 100.00 1.0315 99.333 1.0676 99.667 1.7793 99.667 1.9189 99.333 1.6757
99.333 1.3559 100.00 2.0225 98.667 1.5094 98.667 1.1396 99.000 1.9054 99.333 2.0272 99.333 0.9459 99.000 2.5090 99.667 1.4189 99.333 1.7207
330
G.S. Badrinath and P. Gupta
2 Palm Code Ordinal Code Proposed
FRR (%)
1.5
1
0.5
0 −4 10
−3
10
−2
10
−1
10 FAR (%)
0
10
1
10
2
10
Fig. 7. ROC Curves of Proposed, Palm Code [1] and Ordinal-Code [2]
best performance in both identification and verification mode of the proposed system are found to be 20, 15, 30 and 15 respectively. The system proposed in [16] has used DCT of local region and then variance of the DCT coefficients of local region has been considered as the features. In [15] DCT has been used along with PCA and Radial Basis Neural Network for feature extraction and it has been evaluated on non-public dataset. As mentioned in [2], performance of ordinal code [2] is better than that of [15], [16], competitive code [23] and fusion code [22]. Hence ordinal code [2] based system is used for comparisons. Table 3 summarises the CRR and EER of the proposed system with optimal parameters and the best known systems proposed in palm-code [1] and ordinal code [2]. It can be observed from Table 3 that the proposed system has low EER and high CRR compared to best known systems [1] and [2]. The Receiver Operating Characteristics (ROC) curve which is used as graphical depiction of performance of the system for verification mode is generated by plotting FAR against FRR for different values of threshold. Fig 7 shows the ROC curves of the proposed system, best known systems [1], and [2]. It can be observed that the proposed system has low FRR at low FAR compared to [1] and [2]. Hence the proposed system can be considered as better than the best known systems [1], [2]. Table 3. Performance of Proposed System, [1] and [2] System
CRR %
EER %
Proposed 100.000 0.0073 Palm-Code [1] 99.9168 0.5730 Ordinal-Code [2] 99.9743 0.0838
A Novel Representation of Palm-Print for Recognition
331
Table 4. Speed Comparison Time
Proposed Palm-Code [1] Ordinal-Code [2]
Extraction (ms) 156.4912 54.4166 Matching (ms) 0.3988 7.2693 Total (ms) 156.8900 61.68593
6.3
155.1370 20.5623 175.6994
Speed
The proposed and the best known systems [1] [2] are implemented using MATLAB. The systems are evaluated on an Oct-Core (8 × 2.5GHz) workstation with 8 GB RAM. The time taken to extract and to match features for these systems are shown in Table 4. It is observed that time required for feature extraction in case of the proposed system and [2] remains almost same. However, time required for matching in the proposed system is very low compared to that for the systems in [1] [2].
7
Conclusions
In this paper, a technique to extract features of palm-print has been proposed and evaluated for identification and verification. The technique to extract features is based on variation of DCT coefficients of adjacent strips. Hamming distance is used as classifier. The system is tested using largest known public PolyU databases of 7,752 hand images. A procedure to extract palm-print with respect to stable valley points of finger has been proposed which can extract larger area compared to processing technique in [1]. The variation of brightness is corrected and the texture of palm-print is improved using coarse estimate of reflection and histogram equalization. The enhanced palm-print is found to have well distributed texture and contrast enhanced image. Experimentally optimum parameters of proposed system are determined and CRR of 100% is achieved along with EER of 0.0073%. The system is found to outperform the state-ofthe-art systems in literature. Experimentally it is observed that the proposed system has very low execution time for matching. Thus CRR, EER and speed make the system suitable for on-line and high-end applications.
References 1. Zhang, D., Kong, A.W., You, J., Wong, M.: Online palmprint identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1041–1050 (2003) 2. Sun, Z., Tan, T., Wang, Y., Li, S.: Ordinal palmprint representation for personal identification. In: International Conference on Computer Vision and Pattern Recognition, vol. I, pp. 279–284 (2005)
332
G.S. Badrinath and P. Gupta
3. Jain, A., Ross, A., Prabhakar, S.: An introduction to biometric recognition. IEEE Trans. on Circuits and Systems for Video Technology 14, 4–20 (2004) 4. Shu, W., Zhang, D.: Automated personal identification by palmprint. Optical Engineering 37, 2359–2362 (1998) 5. Zhang, D.D.: Palmprint Authentication. Kluwer Academic Publishers, Dordrecht (2004) 6. Kong, A., Zhang, D., Lu, G.: A study of identical twins’ palmprints for personal authentication. Pattern Recognition 39, 2149–2156 (2006) 7. Chen, J., Zhang, C., Rong, G.: Palmprint recognition using crease. In: International Conference on Information Processing, pp. 234–237 (2001) 8. Han, C., Cheng, H., Lin, C., Fan, K.: Personal authentication using palm-print features. Pattern Recognition 36, 371–381 (2003) 9. Jia, W., Huang, D., Zhang, D.: Palmprint verification based on robust line orientation code. Pattern Recognition 41, 1521–1530 (2008) 10. Zhang, D., Shu, W.: Two novel characteristics in palmprint verification:datum point invariance and line feature matching. Pattern Recognition 32, 691–702 (1999) 11. Wang, X., Gong, H., Zhang, H., Li, B., Zhuang, Z.: Palmprint identification using boosting local binary pattern. In: Intl. Confrence on Pattern Recognition, pp. 503– 506 (2006) 12. Wenxin, L., Zhang, D., Zhuoqun, X.: Palmprint identification by fourier transform. International Journal of Pattern Recognition and Artificial Intelligence 16, 417–432 (2002) 13. Connie, T., Teoh, A., Ong, M., Ngo, D.: An automated palmprint recognition system. Image and Vision Computing 23, 501–515 (2005) 14. Ribaric, S., Fratric, I.: A biometric identification system based on eigenpalm and eigenfinger features. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1698–1709 (2005) 15. Yu, P.F., Xu, D.: Palmprint recognition based on modified dct features and rbf neural network, vol. 5, pp. 2982–2986 (2008) 16. Dale, M., Joshi, M., Gilda, N.: Texture based palmprint identification using dct features, pp. 221–224 (2009) 17. Zhang, L., Zhang, D.: Characterization of palmprints by wavelet signatures via directional context modeling. Systems, Man, and Cybernetics (34), 1335–1347 18. Badrinath, G.S., Gupta, P.: Stockwell transform based palm-print recognition. Applied Soft Computing (in press) 19. Wu, X., Zhang, D., Wang, K.: Fisherpalms based palmprint recognition. Pattern Recognition Letters 24, 2829–2938 (2003) 20. Badrinath, G.S., Gupta, P.: Robust biometric system using palmprint for personal verification. In: International Conference on Biometrics, pp. 554–565 (2009) 21. Badrinath, G.S., Kachhi, N., Gupta, P.: Verification system robust to occlusion using low-order zernike moments of palmprint sub-images. In: Telecommunication Systems, pp. 1–16 (2010) 22. Kong, A., Zhang, D.: Feature-level fusion for effective palmprint authentication. In: International Conference on Biometrics, pp. 761–767 (2004) 23. Kong, A., Zhang, D.: Competitive coding scheme for palmprint verification. In: International Conference on Pattern Recognition, vol. I, pp. 520–523 (2004) 24. Kumar, A., Zhang, D.: Personal recognition using hand shape and texture. IEEE Transactions on Image Processing 15, 2454–2461 (2006) 25. Jing, X., Zhang, D.: A face and palmprint recognition approach based on discriminant dct feature extraction. Systems, Man, and Cybernetics-B 34 (2004)
A Novel Representation of Palm-Print for Recognition
333
26. (The polyu palmprint database), http://www.comp.polyu.edu.hk/~ biometrics 27. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Computers 23, 90–93 (1974) 28. Feig, E., Winograd, S.: Fast algorithms for the discrete cosine transform. IEEE Transactions on Signal Processing 40, 2174–2193 (1992) 29. Hafed, Z.M., Levine, M.D.: Face recognition using the discrete cosine transform. International Journal of Computer Vision 43, 167–188 (2001) 30. Monro, D., Rakshit, S., Zhang, D.: Dct-based iris recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 586–595 (2007) 31. Pavlidis, T.: Algorithms for Graphics and Image Processing. Computer Science Press, Rockville (1982)
Real-Time Robust Image Feature Description and Matching Stephen J. Thomas, Bruce A. MacDonald, and Karl A. Stol Faculty of Engineering, The University of Auckland, Auckland 1142, New Zealand
Abstract. The problem of finding corresponding points between images of the same scene is at the heart of many computer vision problems. In this paper we present a real-time approach to finding correspondences under changes in scale, rotation, viewpoint and illumination using Simple Circular Accelerated Robust Features (SCARF). Prominent descriptors such as SIFT and SURF find robust correspondences, but at a computation cost that limits the number of points that can be handled on low-memory, low-power devices. Like SURF, SCARF is based on Haar wavelets. However, SCARF employs a novel non-uniform sampling distribution, structure, and matching technique that provides computation times comparable to the state-of-the-art without compromising distinctiveness and robustness. Computing 512 SCARF descriptors takes 12.6ms on a 2.4GHz processor, and each descriptor occupies just 60 bytes. Therefore the descriptor is ideal for real-time applications which are implemented on low-memory, low-power devices.
1
Introduction
The problem of finding corresponding points between images is a critical component of a wide range of computer vision applications such as object and place recognition, and camera motion and 3D structure estimation. Many of these applications are subject to strict real-time constraints, and consequently the correspondence problem must be solved in real-time. While many real-time solutions to the problem have been proposed, very few are able to cope with a large number of points or handle significant changes in viewpoint [1][2]. The task of finding corresponding points between images can be divided into three main steps: detection, description and matching. Firstly, a detector selects keypoints in the image which are both stable (consistently selected for all images) and accurate (well localised for all images) [3]. Detectors such as FAST [4] perform this step efficiently, however they have poor stability under changes in scale. Lindeberg [5] introduced the concept of automatic scale selection using a scale-invariant detector which finds maxima (blobs) in a normalised Laplacian scale-space. The widely used SIFT [6] and SURF [7] detectors are based on this concept, but use approximations of the Laplacian in order to reduce computational complexity. The recently released star detector from Willow Garage1 is 1
http://pr.willowgarage.com/wiki/Star_Detector
R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 334–345, 2011. Springer-Verlag Berlin Heidelberg 2011
Real-Time Robust Image Feature Description and Matching
335
considerably faster than both the SIFT and SURF detectors, and uses rotated squares to approximate the Laplacian with very few computations. Once keypoints are detected, a compact, distinctive and robust descriptor is generated for each keypoint based on the pixels in its local neighbourhood [7]. These descriptors must be highly distinctive in order to differentiate between similar objects, yet sufficiently robust to enable matching in the presence of noise, localisation error, and changes in viewpoint and illumination [7]. The descriptor proposed in this paper builds on the strengths of the SURF descriptor, which performs remarkably well in both standardised tests and practical applications. However, the proposed descriptor can be computed significantly faster, making it possible to find correspondences between 800x640 pixel images at 15.9fps compared to 1.8fps for SURF on the same platform. While faster SURF implementations are available for multi-core processors and graphics processing units (GPUs), the proposed descriptor is designed to achieve real-time performance on (mobile) low-spec, single-core platforms with limited memory. The final step in finding corresponding points between images involves matching keypoints by computing a similarity metric between their descriptors and applying a threshold or some other heuristics [7]. In this step the numerical representation and dimension of the descriptors affects both the computational cost of matching and the distinctiveness of the descriptor [7]. In the remainder of this paper we describe a novel feature descriptor which can be computed considerably faster than state-of-the-art descriptors, without compromising distinctiveness and robustness.
2 2.1
Related Work Scale Invariant Feature Transform (SIFT)
The SIFT descriptor is based on models of biological vision which suggest that humans classify objects by matching gradients that are loosely localised on the retina. SIFT attempts to emulate this model by constructing a 128 element vector of gradient magnitudes at quantised orientations and locations. The ‘loose localisation’ of gradients is implemented by centring a 4x4 grid at the keypoint origin, aligning the grid with respect to the keypoint orientation, and constructing an 8 bin gradient histogram from gradient samples within each grid space as illustrated in Fig. 2. To avoid sudden changes in the descriptor due to changes in keypoint position, the magnitude of all samples is weighted by a Gaussian centred at the origin. To minimise changes in the descriptor as samples shift from one grid space to another, or one histogram bin to another, trilinear interpolation is used to distribute the contribution of each sample appropriately. Finally, to obtain illumination invariance the descriptor is normalised to unit length [6]. A large number of authors have proposed refinements to the SIFT descriptor that improve robustness or reduce computation times. For example, Mikolajczyk [1] proposed a variant named GLOH that considers more spatial regions for the histograms. The higher dimensionality of the descriptor is reduced to 64 through principal components analysis.
336
S.J. Thomas, B.A. MacDonald, and K.A. Stol
2.2
Speeded-Up Robust Features (SURF)
Like SIFT, SURF uses a distribution-based descriptor. However the descriptor has only 64 elements and is based on first order Haar wavelets. The first step in the descriptor generation involves assigning a reproducible orientation to each keypoint based on Haar-wavelet responses within a circle of radius 6s around the keypoint, where s is the keypoint scale. The responses are computed at a sampling step of s, and each response is weighted by a Gaussian centred at the origin. The weighted responses are then interpreted as vectors and summed over a sliding window of width π3 as illustrated in Fig. 1. The orientation of the largest resulting vector is assigned as the keypoint orientation [7].
Fig. 1. Orientation assignment: The Gaussian weighted Haar wavelet responses are summed within a sliding window of width π3 to produce a local orientation vector. The orientation of the vector with the largest magnitude is assigned to the keypoint [7].
Like SIFT, the ‘loose localisation’ of samples is implemented by placing a 4 × 4 grid at the keypoint origin, aligning the grid with respect to the keypoint orientation, and generating one segment of the descriptor for each grid space as illustrated in Fig. 2 (left). However, unlike SIFT the segment of the descrip tor computed for each grid space is the vector v = ( dx , dy , |dx |, |dy |), where dx and dy are the x and y Haar responses rotated relative to the keypoint orientation as shown in Fig. 2 (right). Finally, the individual vectors are concatenated and illumination invariance is attained by normalising the resulting vector to unit length [7]. SURF’s speed up relative to SIFT is largely due to the use of integral images [8] and Haar wavelet responses rather than conventional gradients. Using integral images, the sum of the pixels in any rectangular region of an image can be computed with just four array references. Haar responses are essentially the difference between the sum of the pixels in two adjacent rectangles, and thus can be computed at any scale with just six array references. Integral images can be computed efficiently in a single pass over the original image, and are available as an output of the star detector at no additional computational cost.
Real-Time Robust Image Feature Description and Matching
337
Fig. 2. SURF/SIFT structure: A 4 × 4 grid at the keypoint origin is aligned with the keypoint orientation. Samples within the grid are weighted by a Gaussian centred at the origin. Within each grid-space SURF computes 25 Haar wavelet responses in the x and y directions at regularly spaced points. The x and y responses are then to interpolated d d |d obtain d x and dy relative to the orientation. Finally, the sums x, y, x | and |dy | are computed for each grid space [7]. Within each grid-space SIFT computes 16 gradient magnitudes and orientations at regularly spaced points. An 8-bin gradient orientation histogram is then constructed for each of the grid spaces [6].
2.3
Compact Signatures
Calonder et al. [2] proposed a novel solution to the correspondence problem. The solution is based on Fern classifiers that are trained to recognise a large database of keypoints using binary tests. Previously unseen keypoints are then described in terms of each classifiers response or ‘signature vector’. These signature vectors are long but sparse, and are compacted via multiplications with random projection matrices to obtain 176-byte compact signatures. Experimental results show that 512 compact signatures can be generated in 7.9ms and matched to another 512 keypoints using an exhaustive nearest-neighbour search in just 6.3ms on a 2.4Ghz processor [2]. These times represent the state-of-the-art for descriptor generation and matching on low memory, low power devices, thus Compact Signatures are used as a performance benchmark.
3
Simple Circular Accelerated Robust Features (SCARF)
The SCARF descriptor builds on the strengths of SURF in that it is based on Haar wavelet responses which are computed efficiently using integral images. SCARF utilises a novel non-uniform sample distribution, weighting scheme, structure, and numerical representation which together significantly reduce computational cost and memory requirements without impacting on distinctiveness and robustness. Consequently, the descriptor can be generated and matched exceptionally fast on low-spec, single-core platforms with limited memory.
338
3.1
S.J. Thomas, B.A. MacDonald, and K.A. Stol
Keypoint Orientation Estimation
The keypoint orientation step is fundamentally similar to SURF, but takes advantage of the fact that a slightly lower level of precision has negligible effect on robustness [7]. The keypoint orientation is estimated based on 80 Haar responses non-uniformly distributed within a circular neighbourhood of radius 6s around the keypoint as illustrated in Fig. 3. The 80 samples are weighted by a Gaussian (σ = 2s) centred at the keypoint origin using a compact lookup table.
Fig. 3. Top right quarter of the sampling map: Entries 0-71 identify which of the 72 sampling angles the sample position contributes to and - represents an unsampled position. Samples between the origin and red line (radius = 6s) are used to compute the keypoint orientation. Shading differentiates between samples in the central bin, inner ring and outer ring of the descriptor.
Histograms of the response magnitudes for different response angles are constructed as illustrated in Fig. 4. A bin width of 5◦ was chosen after studying the orientation variance for correct matches between image pairs which exhibit changes in scale, rotation and viewpoint. The x and y response magnitudes within a 12 bin sliding window are then summed, and the resulting sums are considered to represent a local orientation vector. The angle of the local orientation vector with the greatest magnitude is assigned as the keypoint orientation. Keypoints that do not have a single dominant orientation are identified by finding local orientation vectors with a magnitude within 90% of the maximum and a different orientation. If attempting to maximise the number of matches multiple keypoints can be created at the same location with different orientations. However, these keypoints are almost always incorrectly matched, thus in some circumstances it may be better to discard these keypoints.
Real-Time Robust Image Feature Description and Matching
339
Fig. 4. Keypoint orientation estimation: Histograms of x and y response magnitude for different response angles. x and y computed over a 12 bin sliding window. The magnitude of the local orientation vector at each window position is given by ( x)2 + ( y)2 and the orientation by atan2( y, x). The horizontal line at 90% of the maximum represents the threshold for considering whether multiple dominant orientations exist. The sliding window and corresponding values are shown for the local orientation vector with the greatest magnitude.
3.2
Descriptor Structure
The primary aim of this research was to develop a descriptor with comparable robustness to SURF which can be computed in real-time on low-spec, single-core platforms with limited memory. A descriptor was defined to have comparable robustness if its matching performance (recall) was no more than 5% lower than SURF. The majority of the descriptor computation time is attributed to gradient computation, thus the goal was also to minimise the number of gradient samples. The optimal solution according to these criteria was identified using a Matlab script which automatically generated C++ code for circular descriptors with a wide variety of internal divisions and non-uniform sampling distributions. The script considered circular descriptors with radii between 7s and 20s pixels (where s is the keypoint scale), between 2 and 5 divisions in the radial direction, and between 2 and 16 sectors per ring in the angular direction. In each case samples were weighted by a Gaussian centred at the origin and a normalisation factor to account for the varying number of samples per sector. Each descriptor was evaluated based on its recall-precision curves for a number of image pairs which exhibit changes in scale, rotation and viewpoint. A number of different image datasets were used and in each case the results were similar, suggesting the approach generalises well. Analysing the results it was evident that the descriptor illustrated in Fig. 5, which has three divisions in the radial direction with radii 3s, 8s and 15s and 1, 6 and 8 sectors respectively provided the best performance.
340
S.J. Thomas, B.A. MacDonald, and K.A. Stol
Fig. 5. SCARF structure: The descriptor is circular with three divisions in the radial direction at 3s, 8s and 15s. The central bin is not divided, the inner ring has 6 bins and the outer ring 8 bins. Samples within the bins are weighted by a Gaussian in the angular direction as illustrated by the dark to light gradient. The sample locations are non-uniformly distributed but roughly lie within a polar grid with 72 angles (sampling angles), 2 divisions in the inner ring and 3 divisions in the outer ring. The x and y point responses are interpolated to obtain dx and dy , such that positive values of dy away from the center. The descriptor for each of the 15 bins is given by the sums dx , |dx | and |dy | over the samples within the bin and a small overlap. dy ,
In the case of SIFT and SURF gradient samples are uniformly distributed with respect to the keypoint orientation, and consequently the sample positions must be re-computed for each keypoint. Therefore, to minimise computational cost it is desirable not only to minimise the number of gradients computed, but also to compute gradients at locations which are independent of the keypoints orientation. However, in order to employ a ‘fixed’ sampling distribution there must be an efficient method by which samples can be accessed depending on the keypoint orientation and descriptor divisions. Therefore, the non-uniform sampling distributions considered were generated using polar coordinates, such that samples could be automatically separated according to the radial divisions of the descriptor and their angular position relative to the keypoint orientation. The central bin of the descriptor was handled separately as it is not divided in the angular directions. The best sampling distribution used a sampling step of 2s for the central bin, 72 sampling angles, 2 samples per angle in the inner ring, and 3 samples per angle in the outer ring. A section of this sampling distribution is illustrated in Fig. 3, where the indices 0-71 indicate which of the 72 sampling angles the sample is associated with, and the shading represents the radial divisions. This sampling distribution consists of just 376 samples which are used for both descriptor generation and keypoint orientation estimation. When designing a feature descriptor it is important to avoid all boundary effects which occur as samples shift from one bin to another due to small changes in the keypoint orientation or position [6]. After investigating weighting schemes it was apparent that any weighting scheme applied in the radial direction had little impact on robustness, suggesting degradation in performance due to boundary
Real-Time Robust Image Feature Description and Matching
341
effects is primarily due to changes in keypoint orientation. Boundary effects in the angular direction were taken into account by a Gaussian weighting based on the samples angular distance from the sector center, and also an overlap with adjacent sectors as illustrated in Fig. 5. To obtain rotation invariance the gradients in the x and y image directions are rotated to obtain dx and dy such that each dy points away from the keypoint origin as illustrated in Fig. 5. Once the keypoint orientation is known the angular weighting to each dx and dy and the descriptor vector, is applied v = ( dx , dy , |dx |, |dy |) for each sector is obtained by summing over the appropriate angles. The complete descriptor is normalised to provide invariance to changes in contrast. At the normalisation stage each element of the vector is also quantised such that the individual values can be stored as a single byte in memory. This enables fast matching using single instruction multiple data (SIMD) extensions available on most processors which operate on up to 16 bytes per instruction. SIFT and SURF are loosely based upon a model of biological vision in which the brain recognises objects using complex neurons that respond to gradients of particular orientations and spatial frequencies detected by the photo-receptive cells on the retina. In this model the gradients do not have to be precisely localised on the retina as the complex neurons allow for their positions to shift over a small receptive field. SIFT and SURF attempt to emulate this model by grouping uniformly sampled gradients around a keypoint using a 4x4 square grid [6, 7]. However, the use of a uniform gradient distribution and grid based gradient grouping does not closely imitate the human vision system, in which photo-receptive cells are distributed non-uniformly, and the number of photoreceptive cells grouped together (connected to a single complex neuron) increases with distance from the focal point (Fovea). In this respect the SCARF descriptor more closely imitates the human vision system.
4
Experimental Evaluation
In this section we compare the matching performance and computation times of SCARF with SURF2 , and the state-of-the-art Compact Signatures descriptor3 . Results are presented for four standard image sequences4 : Boat, in which image scale and orientation are changed simultaneously by rotating the camera around its optical axis by 30-45 degrees and varying the camera zoom and focus by approximately 2-2.5 times. Wall, in which image viewpoint is changed by moving the camera from a fronto-parallel view to one with significant foreshortening at approximately 50-60 degrees. Light, in which images are darkened by reducing the camera aperture, and Trees in which images are blurred by defocussing [1]. 2 3 4
http://www.vision.ee.ethz.ch/~surf/ https://code.ros.org/svn/wg-ros-pkg/trunk/stacks/visual_feature_ detectors/calonder_descriptor http://www.robots.ox.ac.uk/~vgg/research/affine/
342
4.1
S.J. Thomas, B.A. MacDonald, and K.A. Stol
Matching Performance
Matching performance is evaluated using the standard test and recall-precision criterion used by Mikolajczyk [1] to compare the performance of local descriptors. Recall is the ratio between the number of correctly matched regions and the total number of corresponding regions between two images. Given a descriptor DA for region A in a reference image, a descriptor DB for a region B in the compared image, and the homography between the images H, the two regions are corresponding if the overlap error S = 1 − (A ∩ H T BH)/(A ∪ H T BH) is less than 0.5. The two regions are correctly matched if they are corresponding AND the distance between their descriptors is below a threshold t. The standard plots show recall vs. 1-precision for varying t, where 1-precision is the ratio between the number of incorrectly matched regions and the number of matched regions. Therefore, a good descriptor should have a high recall rate for low 1-precision. Fig. 6 shows robustness to changes in viewpoint evaluated using images 1 and 3 from the Wall image sequence. From this figure it is evident that SCARF fulfils the design requirement of having comparable matching performance to SURF. The Compact Signatures descriptor performed much worse than SURF in these standardised tests, despite the results shown in [2]. This may be due to differences between the implementation used in [2] and the author’s publicly available implementation3 . However, in the evaluation method of [2] the keypoint positions in each test image are obtained using the known geometric relationship between the images rather than using a keypoint detector. This has two significant effects on the results: firstly scale-invariant descriptors are disadvantaged as each keypoint’s scale is not re-evaluated for each image, and secondly the correspondences are ‘perfect’ so the effect of small variations in keypoint position are not reflected in the results. In addition, the threshold-based matching used in this paper and [1] better reflects the distribution of the descriptors in space and therefore provides a more accurate indication of the descriptors distinctiveness. Fig. 7 shows robustness to simultaneous changes in scale (zoom) and rotation evaluated using images 1 and 2 of the Boat dataset. In this case the SCARF and
Fig. 6. Recall-precision for changes in viewpoint
Real-Time Robust Image Feature Description and Matching
343
Fig. 7. Recall-precision for changes in scale and rotation
Fig. 8. Recall-precision for increasing levels of image blur
SURF descriptors also perform comparably. However, the Compact Signatures descriptor produced so few correct matches that recall barely increases above zero. The performance of SCARF and SURF were comparable across all the images in the test sequences, however the performance of Compact Signatures was observed to degrade rapidly with changes in scale. Fig. 8 shows robustness to image blur evaluated using images 1 and 4 of the Trees dataset. Once again the results show the comparable performance of the SCARF and SURF descriptors. Finally, Fig. 9 shows robustness to illumination changes evaluated using images 1 and 4 of the Light dataset. In this case SCARF clearly outperforms both the SURF and Compact Signatures descriptors. The SCARF descriptor was optimised to have comparable performance to SURF for changes in viewpoint, scale and rotation. Therefore, these results suggest that the optimisation was successful and generalises well to images which are not in the optimisation dataset. Robustness to changes in illumination were not considered in the optimisation as SCARF achieves invariance to changes in illumination and contrast using the same approach as SURF. Analysis suggests SCARF’s superior performance in this case is due to the novel structure, weighting scheme, sampling distribution and Haar wavelet sizes providing greater
344
S.J. Thomas, B.A. MacDonald, and K.A. Stol
Fig. 9. Recall-precision for changes in illumination Table 1. Comparison of computation times Descriptor
Generation Time (ms) Matching Time (ms)
SURF SCARF Compact Signatures GPU SURF P-SURF (Optimised & Multi-core)
315.0 12.6 7.9 5.2 15.2
210.0 3.3 6.3 6.8 -
robustness to non-linear illumination changes. Non-linear illumination changes are present in the images due to changes in camera parameters, viewpoint dependent illumination, and the cameras inherent non-linear mapping between the amount of light arriving at the imaging sensor and the pixels intensity. 4.2
CPU Time and Memory Requirements
Table 1 summarises the descriptor generation and matching times for the first image of the Graffiti dataset (800 × 640 pixels). All times are given for a 2.4Ghz single-core processor except GPU SURF [9] (GeForce Go 7950 GTX GPU) and P-SURF [10] (2.4GHz dual-core processor). Descriptor generation times are for 512 keypoints, and matching times for an exhaustive 512×512 nearest-neighbour search. Comparing computation times the SCARF descriptor is approximately 25 times faster than SURF, and 17% faster than the highly optimised P-SURF running on a dual-core processor. The SCARF matching time is also significantly lower due to the use of a byte based descriptor, and descriptor comparison based on the sum of absolute differences using SIMD instructions. The impact of ‘quantising’ the SURF descriptor and applying the same matching optimisations was also investigated and results indicate that degradation in recall is less than 5%. While GPU-SURF descriptor generation is faster, the implementation requires graphics cards typically available only on high-spec desktop platforms.
Real-Time Robust Image Feature Description and Matching
5
345
Conclusions
We have proposed a novel keypoint descriptor which can be generated and matched faster and with less memory than many state-of-the-art methods. The descriptor builds on the strengths of SURF in that it is based on Haar wavelets and exploits integral images for speed. The descriptor employs a novel nonuniform sampling distribution, weighting scheme, structure, numerical representation, and matching technique that provides computation times comparable to the state-of-the-art without compromising distinctiveness and robustness. The descriptor is ideal for real-time applications such as structure and motion estimation which are implemented on low memory, low power mobile devices. The script developed for generating and testing descriptors can also be used to ‘optimise’ the descriptor according to other criterion such as maximising recall.
References 1. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. PAMI 27, 1615–1630 (2005) 2. Calonder, M., Lepetit, V., Konolige, K., Bowman, J., Mihelich, P., Fua, P.: Compact signatures for high-speed interest point description and matching. In: ICCV (2009) 3. Agrawal, M., Konolige, K., Blas, M.R.: Censure: Center surround extremas for realtime feature detection and matching. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 102–115. Springer, Heidelberg (2008) 4. Rosten, E., Porter, R., Drummond, T.: Faster and better: A machine learning approach to corner detection. PAMI (2009) 5. Lindeberg, T.: Feature detection with automatic scale selection. IJCV 30, 79–116 (1998) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91–110 (2004) 7. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404– 417. Springer, Heidelberg (2006) 8. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR, vol. 1, pp. 511–518 (2001) 9. Cornelis, N., Van Gool, L.: Fast scale invariant feature detection and matching on programmable graphics hardware. In: CVPR Workshop, pp. 1–8 (2008) 10. Zhang, N.: Computing parallel speeded-up robust features (P-SURF) via POSIX threads. In: ICIC, pp. 287–296 (2009)
A Biologically-Inspired Theory for Non-axiomatic Parametric Curve Completion Guy Ben-Yosef and Ohad Ben-Shahar Computer Science Department, Ben-Gurion University, Beer-Sheva, Israel
Abstract. Visual curve completion is typically handled in an axiomatic fashion where the shape of the sought-after completed curve follows formal descriptions of desired, image-based perceptual properties (e.g, minimum curvature, roundedness, etc...). Unfortunately, however, these desired properties are still a matter of debate in the perceptual literature. Instead of the image plane, here we study the problem in the mathematical space R2 × S 1 that abstracts the cortical areas where curve completion occurs. In this space one can apply basic principles from which perceptual properties in the image plane are derived rather than imposed. In particular, we show how a “least action” principle in R2 × S 1 entails many perceptual properties which have support in the perceptual curve completion literature. We formalize this principle in a variational framework for general parametric curves, we derive its differential properties, we present numerical solutions, and we show results on a variety of images.
1
Introduction
Although the visual field is often fragmented due to occlusion, our perceptual experience is one of whole and complete objects. This phenomenon of amodal completion, as well as its counterpart process of modal completion (where the object is illusory), are fundamental part of perceptual organization and visual processing [17], and they have been studied by all of the perceptual, neuropysiological, and computational vision communities. Inspired by this interdisciplinary nature of visual completion research, in this work we present a new and rigorous mathematical curve completion theory, which is motivated by neurophysiological findings and at the same time uses only basic principles to provide explanations and predictions for existing perceptual evidence. The problem of curve completion usually assumes that the completed curve is induced by two oriented line segments, also known as the inducers, and hence it is typically formulated as follows: Problem 1. Given the position and orientation of two inducers p0 = [x0 , y0 ; θ0 ] and p1 = [x1 , y1 ; θ1 ] in the image plane, find the shape of the “correct” curve that passes between these inducers. While “correctness” of the completion and hence the problem as a whole are ill defined, it is typical to aspire for completions that agree the most with perceptual and neurophysiological evidence, usually by the application of additional R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 346–359, 2011. Springer-Verlag Berlin Heidelberg 2011
Non-axiomatic Parametric Curve Completion
347
constraints. Unfortunately, however, while different constraints have been proposed, they were inspired more by intuition and mathematical elegance, and less by perceptual findings or neurophysiological principles. In the next section we first review some of the relevant computational models, the axiomatic approach which often motivates them, and the extent to which they agree with existing perceptual and neurophysiological evidence.
2
Previous Work
Many computational curve completion studies, including Ullman’s [22] seminal work, employ an axiomatic perspective to the shape of the completed curve, i.e., they state the desired perceptual characteristics that the completed curve should satisfy. Among the most popular axioms suggested are – isotropy - invariance of the completed curve to rigid transformations. – smoothness - The completed curve is analytic (or in some cases, differentiable once). – total minimum curvature - integral of curvature along the curve should be as small as possible. – extensibility - any two arbitrary tangent inducers along a completed curve C should generate the same shape as the shape of the portion of C connecting them. – Scale invariance - the completed shape should be independent on the viewing distance. – Roundedness - the shape of a completed curve induced by two cocircular inducers should be a circle. – Total minimum change of curvature - integral of the derivative of curvature along the curve should be as small as possible. Evidently, some of the suggested perceptual axioms conflict each other, and hence much debate exist in the computational literature as to which axioms are the “right” ones, a choice which affects the entailed completion model. Among the first suggested models, the biarc model, was proposed by Ullman [22], who sought to satisfy the first four axioms from the list above. According to this model, the completed curve between two inducers consists of two circular arcs, each tangent both to an inducer and to the other arc. Since the number of such biarc pairs is infinite, the axiom of minimal total curvature helps to narrow down the possible completions to one unique curve. While Ullman took the axiom of total minimum curvature in the narrow scope of biarc curves only, the same axiom was studied in its strict sense by Horn [9], and later Mumford [15]. The resulting class of curves, known as elastica, minimizes the functional that expresses total curvature square k(s)2 ds, and its corresponding Euler-Lagrange equation leads to a differential equation that must be solved in order to derive the elastica curve of two given inducers. An arclength parametrization form of this equation was shown to be 1 θ˙2 = sin(θ + φ), c
(1)
348
G. Ben-Yosef and O. Ben-Shahar
where θ is the tangential angle of the curve at each position [9]. To solve this equation one needs to resolve the two parameters c and φ, and while some form of explicit solutions have been suggested based on elliptic integrals [9] or theta functions [15], no closed-form analytic solution is yet known. Since the minimum total curvature axiom compromises scale invariance, efforts to unite them resulted in a scale invariant elastica model (e.g., [19]). However, since the elastica model also violates roundedness and the scale invariant elastica model still violates extensibility, Kimia et al. [13] replaced the minimum total curvature axiom with the minimization of total change in curvature. This axiom immediately entails a class of curves known in the mathematical literature as Euler Spirals, whose properties satisfy most other axioms mentioned above, including roundedness (but excluding total minimum curvature, of course). A somewhat different, non axiomatic approach to curve completion was taken by Williams and Jacobs [24] in their stochastic model, which employs assumptions on the generation process (in their case, a certain random walk with useddefined parameters) rather than on the desired final shape. Indeed, although it was not verified against perceptual findings, in their work they have argued that the completed curve is the most likely random walk in a 3D discrete lattice of positions and orientations. Since most computational models employ perceptual axioms that are formalized rigorously and used to define the completed shape, it is important to view these attempts in the context of recent perceptual findings from the visual psychophysics literature. Indeed, visual completion was first described and studied by perceptual researchers (most notably Kanizsa [11]) mostly in order to find when two different inducers are indeed grouped to induce a curve (a.k.a the grouping problem, e.g. [12]). In recent years, however, perceptual studies have also been focusing on measuring and characterizing the shape of visually completed curves (the shape problem) in different experimental paradigms. For example, the dot localization paradigm [8] is a method where observers are asked to localize a point probe either inside or outside an amodally or modally completed boundary, and the distribution of responses is used to determine the likely completed shape. Among the insights that emerged from this and other approaches (e.g., by oriented probe localization [6, 21]) it was concluded that completed curves are perceived (i.e, constructed) quickly, that their shape deviates form constant curvature and thus defies the roundedness axiom (e.g., [8, 21]), and that the completion depends on the distance between the inducers and thus violates scale invariance (e.g., [6]). While this paper focuses on computational aspects and keeps the full fledged perceptual validation for future work, we do show how our new theory supports these very same perceptual findings as a result of basic principle rather than by imposing them as axioms. Finally, since curve completion is not only computational problem, but first and foremost a perceptual task performed by the visual system, it is worth reviewing basic neurophysiological aspects of the latter. The seminal work by Hubel and Wiesel [10] on the primary visual cortex (V1) has shown that orientation selective cells exist at all orientations (and at various scales) for all retinal
Non-axiomatic Parametric Curve Completion
349
positions, i.e., for each “pixel” in the visual field. This was captured by the socalled ice cube model suggesting that V1 is continuously divided into full-range orientation hypercolumns, each associated with a different image (or retinal) position [10]. Hence, an image contour is represented in V1 as an activation pattern of all those cells that correspond to the oriented tangents along the curve’s arclength (Fig 1A). Interestingly, the participation of early visual neurons in the representation of curves is not limited to viewable curves only, and was shown to extend to completed or illusory curves as well. Several studies have found that orientation selective cells in V1 of Macaque monkeys response for illusory contours as well (e.g., [7, 23]), further supporting the conclusion that curve completion is an early visual process that takes place as low as the primary visual cortex.
3
Curve Completion in the Tangent Bundle T (I) = R2 × S 1
The basic neurophysiological findings mentioned above suggest that we can abstract orientation hypercolumn as infinitesimally thick “fibers”, and place each of them at the position in the image plane that is associated with the hypercolumn. Doing so, one obtains an abstraction of V1 by the space R2 × S 1 [1], as is illustrated in Fig. 1. This space is an instance of a fundamental construct in modern differential geometry, the unit tangent bundle [16] associated with R2 . A tangent bundle of a manifold S is the union of tangent spaces at all points of S. Similarly a unit tangent bundle is the union of unit tangent spaces at all points of S. In our case, it therefore holds [16] that
Definition 1. Let I = R2 the image plane. T (I) = R2 ×S 1 is the (unit) tangent bundle of I. Recall our observation that an image contour is represented in V1 as an activation pattern of all those cells that correspond to the oriented tangents along the curve’s arclength. Given the R2 × S 1 abstraction, and remembering that the tangent orientation of a regular curve is a continuous function, we immediately observe that the representation of a regular image curve α(t) is a regular curve β(t) in T (I) (Fig. 1B). The last observation leads us to our main idea: If curve completion (like many other visual processes) is an early visual process in V1, and if V1 can be abstracted as the space T (I), then perhaps this completion process should be investigated in this space, rather than in the image plane I itself. In this paper we offer such a mathematical investigation whereby curve completion is carried out in T (I), followed by projection to I. Part of the our motivation for this idea is that unlike the debatable perceptual axioms in the image plane, perhaps the T (I) space, as an abstraction of the cortical machinery, offers more basic (and not necessarily perceptual) completion principles, from which perceptual properties emerge as a consequence. It should be mentioned that in addition to Williams and Jacobs [24] mentioned above, the space of positions and orientations in vision was already used
350
G. Ben-Yosef and O. Ben-Shahar
Fig. 1. The unit tangent bundle as an abstraction for the organization and mechanisms of the primary visual cortex. (A). The primary visual cortex is organized in orientation hypercolumns [10], which implies that every retinal position is covered by neurons of all orientations. Thus, for each “pixel” p in the visual field we can think of a vertical array of orientation-selective cells extending over p and responding selectively (shown in red) according to the stimulus that falls on that “pixel”. (B). The organization of V1 implies it can be abstracted as the unit tangent bundle R2 × S 1 , where vertical fibers are orientation hypercolumns, and the activation patterns to image curve α (in blue) becomes “lifted” curve β (in red), such that α(t) and β(t) are linked by the admissibility constraint in Eq. 2. (C). Not every curve in R2 × S 1 is admissible, and in particular, linear curves (e.g., in red) are usually inadmissible since they imply a linear progression of the orientation, which contradicts the “straightness” of the projected curve in the image plane (in blue).
in several cases for tasks such as curve integration, texture processing, and curve completion (e.g.,[2–4, 18]). As will be discussed later, our contribution is quite different than the previous attempts to address curve completion in the rototranslational or the unit tangent bundle spaces - it is a variational approach rather than one based on diffusion [4], it is much simpler and facilitates the analysis of perceptual properties (unlike in [4, 18]), and it is unique in incorporating a relative scaling between the spatial and angular dimensions, a critical act for making these dimensions commensurable (unlike in [3, 4, 18]). 3.1
Admissibility in T (I)
At first sight, curve completion in T (I) may not be that different than curve completion in the image plane I, except, possibly, for the higher dimension involved. This intuition, unfortunately, is incorrect. We first observe that when switching to T (I), the curve completion problem (see Sec. 1) becomes a problem of curve construction between boundary points (e.g., the most-left and the most-right red points in Fig. 1B), rather than between oriented inducers. More importantly, as we now discuss, the completed curves in T (I) cannot be arbitrary. In fact, the class of curves that we can consider is quite constrained. Let α(t) = [x(t), y(t)] be a regular curve in I. Its associated curve in T (I) is created by “lifting” α to R2 × S 1 , yielding a curve β(t) = [x(t), y(t), θ(t)], which satisfies (using Newton’s notation for differentiation)
Non-axiomatic Parametric Curve Completion
351
dx dy y(t) ˙ x(t) ˙ ≡ , y(t) ˙ ≡ . (2) x(t) ˙ dt dt We emphasize that α(t) and β(t) are intimately linked by Eq. 2, and that α is the projection of β back to I. Example of such corresponding curves is shown in Fig. 1B. One can immediately notice that while every image curve can be lifted to T (I), not all curves in T (I) are a lifted version of some image curve [3]. We therefore define tan θ(t) =
Definition 2. A curve β(t) = [x(t), y(t), θ(t)] ∈ T (I) is called admissible if and only if ∃α(t) = [x(t), y(t)] such that Eq. 2 is satisfied. There are more inadmissible curves in T (I) than admissible ones and examples are shown in Fig. 1 for both admissible (in panel B) and inadmissible (in panel C) curves. Therefore, any completion mechanism in T (I) is restricted to admissible curves only, and we shall refer to Eq. 2 accordingly as the admissibility constraint. 3.2
“Least Action” Completion in T (I)
Since curve completion in T (I) is still the problem of connecting boundary data in some vector space, any principle that one could use in I is a candidate principle for completion in T (I) also, subject to the admissibility constraint. However, when we recall that T (I) is an abstraction of V1, first completion principle candidate should perhaps attempt to capture likely behavior of neuron populations rather than perceptual criteria (such as the axioms discussed above). Perhaps the simplest of such principles is a ‘minimum energy consumption” or “minimum action” principles, according to which the cortical tissue would attempt to link two boundary points (i.e., active cells) with the minimum number of additional active cells that give rise to the completed curve. In the abstract this becomes the case of the shortest admissible path in T (I) connecting two endpoints (x0 , y0 , θ0 ) and (x1 , y1 , θ1 ). Note that while such a curve in I is necessarily a straight line, most linear curves in T (I) are “inadmissible” in the sense of Definition 2 (See Fig. 1C). Since the shortest admissible curve in T (I) has a non trivial projection in the image plane, we hypothesize that the “minimum action” principle in T (I) corresponds to the visually completed curve, whose geometrical and perceptual properties are induced from first principles rather than imposed as axioms.
4
Minimum Length Curve in the Tangent Bundle
Given the motivation, arguments, and insights above, we are now able to define our curve completion problem formally. Let p0 and p1 be two given endpoints in T (I) which represent two oriented inducers in the image plane I. We seek the shortest admissible path in T (I) between these two given endpoints, i.e., the curve that minimizes p1 p1 ˙ 2 dt ˙ 2 dt = x(t) ˙ 2 + y(t) ˙ 2 + θ(t) (3) β(t) L= p0
p0
subject to the admissibility constraint.
352
G. Ben-Yosef and O. Ben-Shahar
A natural question that comes to mind relates to the units and relative scale of coordinates in T (I). Indeed, while x and y are measured in meters (or other length units), θ is measured in radians (though note that tan θ = xy˙˙ , cos θ, and sin θ are dimensionless). To balance dimensions in the arclength integral, and to facilitate relative scale between the spatial and angular coordinates, a meters should be incorporated in Eq. 3. proportionality constant in units of radians We do this in a manner reminiscent of many physical constants which were first discovered as proportionality constants between dimensions (e.g., the reduced Planck constant which proportioning the energy of a photon and the angular frequency of its associated electromagnetic wave). We thus re-define the unit arclength of a curve in T (I) and formulate our problem as follows Problem 2. Given two endpoints p0 = [x0 , y0 , θ0 ] and p1 = [x1 , y1 , θ1 ] in T (I), find the curve β(t) = [x(t), y(t), θ(t)] that minimizes the functional t1 L(β) = x˙ 2 + y˙ 2 + 2 θ˙2 dt (4) t0
while satisfying the boundary conditions β(t0 ) = p0 and β(t1 ) = p1 and the admissibility constraint from Eq. 2. The rest of our paper is devoted to solving this problem formally, and to analyzing the properties of the solution. Unlike the few previous attempts to address similar problems, where the focus was on the assumption that = 1 only (e.g., [3, 4, 18]), here we not only address the most general problem but we also offer a simpler and the first complete solution to the exact problem, which facilitates the analysis of perceptual properties and a variety of experimental results. 4.1
Theoretical Analysis
In trying to solve Problem 2 we follow the general agreement in the perceptual literature that visually completed curves can not incorporate inflection points (e.g., [12, 20]). Hence, in what follows we intentionally limit our discussion to non-inflectional retinal/image curves. Let α(s) = [x(s), y(s)] be an image curve in arclength parametrization, whose corresponding ‘lifted’ curve in T (I) is β(s) = [x(s), y(s), θ(s)] .
(5)
Representing all admissible curves in T (I) in this form, the functional L from Eq. 4 becomes l ˙ 2 ds , x(s) ˙ 2 + y(s) ˙ 2 + 2 θ(s) (6) L(β) = 0
while the admissibility constraint (Eq. 2) turns to cos θ(s) = x(s) ˙ sin θ(s) = y(s) ˙ ,
(7)
and l is the total length of α(s). Given these observations, the following is our main theoretical result in this paper
Non-axiomatic Parametric Curve Completion
353
Theorem 1. Of all admissible curves in T(I), those that minimize the functional in Eq. 6 belong to a two parameter (c, φ) family defined by the following differential equation 2 c2 dθ −1. (8) = ds sin2 (θ + φ) Proof. We begin with a slight change in the representation of Eq. 6 in order to facilitate the application of calculus of variation to our problem (note that l is ˙ unknown and can not be served as initial condition). Let κ(s) = θ(s) be the . 1 curvature of α and let R(s) = κ(s) be its corresponding radius of curvature. We observe that since no inflection points are allowed, θ is a monotonic function of s, and hence it can be served as a parameter of the curve. Without the loss of generality we may assume that R(s) > 0 and hence θ(s) is monotonically 1 increasing. Since dθ ds = R , it follows that ds = Rdθ, which allows us to rephrase the admissibility constraint in Eq. 7 as: dx = cos(θ)ds = R cos(θ)dθ dy = sin(θ)ds = R sin(θ)dθ .
(9)
Substituting Eq. 9 into Eq. 6 we get 2 l 2 2 dy dx dθ 2 L(β) = + + ds ds ds 0 ds l R2 cos2 (θ)dθ2 dθ2 R2 sin2 (θ)dθ2 2 + + ds , = ds2 ds2 ds2 0 from which we immediately obtain the following new form for our minimization problem, this time in terms of θ (the tangent orientation of the curve) θ1 R(θ)2 + 2 dθ . (10) L(β) = θ0
By using this form to describe the curve, we are at the risk of ignoring the boundary conditions x0 ,x1 and y0 ,y1 , that must be introduced back into the problem1 . This can be done by adding constraints on R(θ) so that the projection of the induced curve is forced to pass through [x0 , y0 ] and [x1 , y1 ]. Expressed in terms of R and θ these constraints become: l θ x1 − x0 = 0 x˙ = θ01 R cos θdθ l θ y1 − y0 = 0 y˙ = θ01 R sin θdθ or θ1 Δx ]dθ = 0 [R cos θ − Δθ θ0θ1 Δy ]dθ = 0 [R sin θ − Δθ θ0 1
Note that the initial conditions on the inducers’ orientation become very explicit in this form.
354
G. Ben-Yosef and O. Ben-Shahar
where Δx = x1 −x0 , Δy = y1 −y0 , and Δθ = θ1 −θ0 . These additional constraints can now be incorporated to our new functional in Eq. 10 using two arbitrary Lagrange multipliers λx and λy , which results in the following complete form for our minimization problem in terms of θ: θ1 Δx Δy ) + λy (R sin θ − )]dθ . (11) [ R2 + 2 + λx (R cos θ − L(β) = Δθ Δθ θ0 With this, the corresponding Euler-Lagrange equation becomes rather simple: √
d 2 + 2 + λ (R cos θ − Δx ) + λ (R sin θ − Δy ) = 0 R x y dR Δθ Δθ thus
R √ + λx cos θ + λy sin θ = 0 . 2 R + 2
(12)
˙ we apply Renaming λx = 1c sin φ , λy = 1c cos φ, and remembering that R = 1/θ, the square function over both sides of Eq. 12 to finally obtain Eq. 8. 4.2
Numerical Solution
The similarity of Eq. 8 to the Elastica Eq. 1 (and also to the Elastica-Pendulum equation [14]) suggests that a closed form analytic solution in unlikely. Instead, we turn to formulate a numerical solution of a first order ordinary differential equation (ODE) with Dirichlet boundary conditions. A standard numerical technique for such ODE is based on nonlinear optimization that seeks the true values of the equation parameters for given boundary points, in our case p0 = (x0 , y0 , θ0 ) and p1 = (x1 , y1 , θ1 ). A first step towards the formulation of such an optimization process for Eq. 8 is to examine the curve’s degrees of freedom vs. the problem’s constraints. Eq. 8 represents a family of planar curves in Whewell form (i.e., an equation that relates the tangential angle of the curve with its arclength), in which their Cartesian coordinates are computed as s s)d˜ s x(s) = x0 + 0 cos θ(˜ (13) s y(s) = y0 + 0 sin θ(˜ s)d˜ s. Thus, a single and unique curve from our family of solutions is determined by 6 degrees of freedom: θ0 , φ, and c are needed to set a single and unique θ(s) function via Eq. 8, and x0 , y0 and l are needed to determine the curve’s coordinate functions x(s) and y(s) from the first to the second inducer via Eq. 13. At the same time, the curve completion problem provides six constraints expressed by the two given inducers: x(0) = x0 x(l) = x1
y(0) = y0 y(l) = y1
θ(0) = θ0 θ(l) = θ1 .
The matched number of degrees of freedom and constraints implies that the optimization process is solvable by a unique solution. It also suggests that the optimization can be framed as an iterative search process that involves three steps: (1)
Non-axiomatic Parametric Curve Completion
355
a selection of values for the parameters c, φ, and l, (2) the construction a curve starting from p0 = [x0 , y0 , θ0 ] in a way that obeys the ODE in Eq. 8, and (3) the evaluation of the correctness of the parameters by assessing the error when s reaches the arclength l. The error E(c, φ, l) between the desired and obtained endpoints at s = l is then used to update the parameters for the next iteration. More specifically, for each iteration i with a given starting point p0 and parameter values c and φ we first solve the differential equation 8 via Euler’s method . sn+1 = sn + h . = θ(sn + h) ≈ θ(sn ) + h · κ(sn ) θn+1 = θ(sn+1 ) 2
yn+1 xn+1
= θn + h · 2 sinc2 (θ+φ) − 12 . ˙ n) = y(sn + h) ≈ y(sn ) + h · y(s = yn + h · sin θn . = x(sn + h) ≈ x(sn ) + h · x(s ˙ n) = xn + h · cos θn
where h is a preselected step size and the error is of order O(h2 ). The curve βi (x) = (x(si ), y(si ), θ(si )) computed by this step is then evaluated at sn = l (i.e, at step n = l/h) to obtain the point (xend , yend , θend ) and the error E(c, φ, l) associated with the current value of the parameters is computed by E(c, φ, l) = (x1 , y1 , θ1 ) − (xend , yend , θend ). The new values for c, φ and l are then computed by a gradient descent on E(c, φ, l).
5
Dependency on Scale and Other Visual Properties
Our discussion so far illustrates how the problem of curve completion can be formulated and solved in the space that abstracts the early visual cortical regions where this perceptual process is likely to occur. Since the theory and the principle of “minimum action” that guides this solution are essentially “non perceptual”, it is important what perceptual properties they entail, and how these predictions correspond to existing perceptual findings and the geometrical axioms reviewed earlier. As suggested in Sec. 4, the value of the constant could have a significant influence over the shape of curves of minimum-length in the tangent bundle, as it controls the relative contribution of length and curvature in the minimization process, or put differently, the relative scale ratio (i.e, the proportion between units of measurement) of arclength and curvature. In this context, the behavior at the limits provide qualitative insights. On one hand, if is very small, the minimization process becomes similar to minimization of length in I (subject to boundary conditions) and we therefore expect the resultant curve to straightened (or “flatten”). On the other hand, when is very large, the minimization process is dominated by the minimization of orientation derivative (again, subject to boundary conditions), a condition which resembles (qualitatively) the classical elastica and converges to a particular shape. To show these properties formally, we rewrite Eq. 8 as
356
G. Ben-Yosef and O. Ben-Shahar
dθ ds
2 =
2
c2 1 − 2 sin (θ + φ) 2
=
c˜ 1 − 2 sin2 (θ + φ)
(14)
where c˜ = c/ is a renaming of the constant that fits the boundary conditions. Using this representation we observe that when → ∞, Eq. 14 converges to 2 dθ c˜ = . (15) 2 ds sin (θ + φ) (Note that in such case c can also be very large and thus balance in c˜ = c/.) The shape of the curve described by this expression can be shown experimentally to be rather rounded although it does not describe a circular arc (see Fig. 2 for particular initial conditions). To examine the curve behavior when → 0, we return to Eq. 8 which now becomes 0=
c2 sin2 (θ+φ)
−1,
or sin(θ + φ) = c .
(16)
Hence, since in the limit sin(θ(s) + φ) must be constant for all s, θ(s) becomes constant also, which implies a linear curve. A demonstration of several curves for the same inducer pair but different values of , are shown in Fig. 2 and Fig. 3. Changing amounts to changing the viewing scale of a particular completion task, i.e., applying a global scale transform on the initial conditions. Hence, one could expect that the effect on the resultant minimum length curve would be similar. To show this formally, we observe that any (i.e., not necessarily minimal) image curve αL (t˜) traveling from p0 = [0, 0, θ0 ] to p1 = [L, 0, θ1 ] can be written as a scaled version of some other curve α1 (t) traveling from [0, 0, θ0 ] to [1, 0, θ1 ]: αL (t) = L · α1 (t) = [L · x(t), L · y(t)] , t ∈ [t0 , t1 ] where α1 (t) = [x(t), y(t)] t ∈ [t0 , t1 ] and x(t0 ) = 0 , x(t1 ) = 1. Hence, the lifted T (I) curve βL (t) that corresponds to αL (t) becomes βL (t) = [L · x, L · y, tan−1
L · y˙ ] = [L · x(t), L · y(t), θ(t)] L · x˙
t ∈ [t0 , t1 ]
where [x(t), y(t), θ(t)] is the tangent bundle curve that corresponds to α1 (t). Note now that the total arclength of βL is 2 t1 t1 2 2 2 2 2 2 2 ˙ L (x˙ + y˙ ) + θ dt = L · x˙ + y˙ + θ˙2 dt . (17) L(βL ) = L t0 t0 Clearly, a minimization of this functional (that was obtained by scaling up the input) is identical to optimizing the original input while scaling down by L and hence we conclude that changing the global scale has an effect inverse to changing . We therefore expect the case of L → 0 to reflect the case of → ∞, and
Non-axiomatic Parametric Curve Completion
357
Fig. 2. Minimum length tangent bundle curves at different scales (or “viewing distances”) and different values. Here shown are two cocircular inducers ([0, 0, 45◦ ] and [0, 8, −45◦ ]) and completion results for different values of . The corresponding elastica (red) and circular curves (green) are shown for comparison. Identical result (up to scale) would be obtained for fixed = 1 and varying inducer distances of {0.5, 1, 2, 4, 8} units. Note the “flattening” of the completed curve with increased scale or decreased .
vice versa. Fig. 2 demonstrates the scale dependency using symmetric inducers as initial conditions. In addition to scaling issues, several properties of our model can be pointed out regarding the six axioms of curve completion mentioned in Sec 1. First, since our solution is not linked to any specific frame, it is trivially isotropic. Note that since the rotated minimum length curve θ˜ = θ + ρ also satisfies Eq. 8 (for the same constant c and φ˜ = φ+ρ), the solution is invariant under rotations. Second, since the solution minimizes total arclength in T (I), it must be extensible in that space and hence in the image plane also. Third, since the completed curves can be described by differential Eq. 8, they clearly satisfy the axiom of smoothness. Obviously, the analysis of scale that was discussed above indicates that our theory generates scale variant solutions, or put differently, it does not satisfy the axiom of scale invariance. Another axiom where our model departs from prior solutions is the axiom of roundedness, since it is easy to confirm that the case of constant curvature ( dθ ds = const) does not satisfy Eq. 8. At first sight these two properties could undermine the utility of our model, but given the refutation of both scale invariance and roundedness at the perceptual and psychophysical level (as discussed in Sec. 1), we consider these properties an important advantage of our theory rather than a limitation of our model. That these properties were derived as emergent properties rather than imposed axioms is yet another benefit of our approach as a whole.
6
Experimental Results
While this paper is primarily theoretical, and while the full utility of our proposed new theory requires psychophysical verification (which is part of our short term future research), here we have also experimented with our results by applying them to selected instances of curve completion problems. Some results are shown in Fig. 3 and demonstrate that the completed curves correspond well our perception. Clearly, the determination of the “correct” is a matter of perceptual calibration, which is outside the scope of this computational paper. For the
358
G. Ben-Yosef and O. Ben-Shahar
Fig. 3. Examples of curve completion in the unit tangent bundle applied in different settings. Two top rows: Swan: Magenta completion reflects = 1 and pixel size (i.e., viewing scale) of 0.04. Blue completion reflects = 2 and pixel size of 0.0025. (Mushroom): Cyan completion reflects = 1.5 and pixel size of 0.01, Yellow curve reflects completion according to the Elastica model, Magenta curve reflects completion according to the Biarc model. (Camel): Magenta completion reflects = 1 and pixel size of 0.01. Bottom row: Results of our model on real occluder image, and on a modal completion stimulus. Cyan and Magenta completion reflects = 1 and pixel size of 0.01.
demonstrated examples we have calibrated manually after setting the scale (i.e., pixel size in length units) arbitrarily (see different settings on the swan demo of Fig. 3). Inducers’ orientation was measured manually and all initial data were fed to the numerical algorithm from Sec. 4.2. In all results the parameters c,φ and l were optimized up to an error E(c, φ, l) ≤ 10−5 . The resultant curves of minimum arclength in T (I) were then projected to the image plane I and plotted on the missing parts. In several cases we also show comparison to other common models, such as the biarc and the elastica models. However, we do note that until curve completion is investigated in a comprehensive fashion against the statistics of natural image contours (e.g., in the spirit of [5]), such comparisons do not necessarily indicate that a certain model is better than others. Studying these links is a primary goal of our short term future research. Acknowledgement. This work was funded in part by the Israel Science Foundation (ISF) grant No. 1245/08. We also thank the generous support of the Frankel fund, the Paul Ivanier center for Robotics Research and the Zlotowski Center for Neuroscience at Ben-Gurion University.
Non-axiomatic Parametric Curve Completion
359
References 1. Ben-Shahar, O., Zucker, S.: The perceptual organization of texture flows: A contextual inference approach. IEEE Trans. Pattern Anal. Mach. Intell. 25, 401–417 (2003) 2. Ben-Shahar, O.: The Perceptual Organization of Visual Flows. PhD thesis, Yale university (2003) 3. Ben-Yosef, G., Ben-Shahar, O.: Minimum length in the tangent bundle as a model for curve completion. In: Proc. CVPR, pp. 2384–2391 (2010) 4. Citti, G., Sarti, A.: A cortical based model of perceptual completion in the rototranslation space. J. of Math. Imaging and Vision 24, 307–326 (2006) 5. Geisler, W., Perry, J.: Contour statistics in natural images: Grouping across occlusions. Visual Neurosci. 26, 109–121 (2009) 6. Gerbino, W., Fantoni, C.: Visual interpolation in not scale invariant. Vision Res. 46, 3142–3159 (2006) 7. Grosof, D., Shapley, R., Hawken, M.: Macaque v1 neurons can signal ‘illusory’ contours. Nature 365, 550–552 (1993) 8. Guttman, S., Kellman, P.: Contour interpolation reveald by a dot localization paradigm. Vision Res. 44, 1799–1815 (2004) 9. Horn, B.: The curve of least energy. ACM Trans. Math. Software 9, 441–460 (1983) 10. Hubel, D., Wiesel, T.: Functional architecture of macaque monkey visual cortex. Proc. R. Soc. London Ser. B 198, 1–59 (1977) 11. Kanizsa, G.: Organization in Vision: Essays on Gestalt Perception. Praeger Publishers, Westport CT (1979) 12. Kellman, P., Shipley, T.: A theory of visual interpolation in object perception. Cognitive Psychology 23, 141–221 (1991) 13. Kimia, B., Frankel, I., Popescu, A.: Euler spiral for shape completion. Int. J. Comput. Vision 54, 159–182 (2003) 14. Levien, R.: The elastica: a mathematical history. Technical Report UCB/EECS2008-103, EECS Department, University of California, Berkeley (2008) 15. Mumford, D.: Elastica in computer vision. In: Chandrajit, B. (ed.) Algebric Geometry and it’s Applications. Springer, Heidelberg (1994) 16. O’Neill, B.: Semi-Riemannian Geometry with applications to relativity. Academic Press, London (1983) 17. Palmer, S.: Vision Science: Photons to Phenomenology. The MIT Press, Cambridge (1999) 18. Petitot, J.: Neurogeometry of v1 and kanizsa contours. AXIOMATHES 13 (2003) 19. Sharon, E., Brandt, A., Basri, R.: Completion energies and scale. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1117–1131 (2000) 20. Singh, M., Hoffman, D.: Completing visual contours: The relationship between relatability and minimizing inflections. Percept. Psychophys. 61, 943–951 (1999) 21. Singh, M., Fulvio, J.: Visual extrapolation of contour geometry. Proc. Natl. Acad. Sci. USA 102, 939–944 (2005) 22. Ullman, S.: Filling in the gaps: The shape of subjective contours and a model for their creation. Biol. Cybern. 25, 1–6 (1976) 23. Von der Heydt, R., Peterhans, E., Baumgartner, G.: Illusory contours and cortical neuron responses. Science 224, 1260–1262 (1984) 24. Williams, L., Jacobs, D.: Stochastic completion fields: A neural model of illusory contour shape and salience. Neural Comp. 9, 837–858 (1997)
Geotagged Image Recognition by Combining Three Different Kinds of Geolocation Features Keita Yaegashi and Keiji Yanai Department of Computer Science, The University of Electro-Communications 51–5–1 Chofugaoka, Chofu-shi, Tokyo, 182–8585 Japan
[email protected],
[email protected]
Abstract. Scenes and objects represented in photos have causal relationship to the places where they are taken. In this paper, we propose using geo-information such as aerial photos and location-related texts as features for geotagged image recognition and fusing them with Multiple Kernel Learning (MKL). By the experiments, we have verified the possibility for reflecting location contexts in image recognition by evaluating not only recognition rates, but feature fusion weights estimated by MKL. As a result, the mean average precision (MAP) for 28 categories increased up to 80.87% by the proposed method, compared with 77.71% by the baseline. Especially, for the categories related to location-dependent concepts, MAP was improved by 6.57 points.
1
Introduction
In these days, due to rapid spread of camera-equipped cellar phones and digital cameras with GPS, the number of geotagged photos is increasing explosively. Geotags enable people to see photos on maps and to get to know the places where the photos are taken. Especially, photo sharing Web sites such as Flickr and Picasa gather a large number of geotagged photos. Here, “geotag” means a two-dimensional vector consisting of values of latitude and longitude which represent where a photo is taken. In this paper, we exploit geotags as additional information for visual recognition of consumer photos to improve its performance. Geotags have potential to improve performance of visual image recognition, since the distributions of some objects are not even but concentrating in some specific places. For example, “beach” photos can be taken only around borders between the sea and lands, and “lion” photos can be taken only in zoos except Africa. In this way, geotags can restrict concepts to be recognized for images, so that we expect geotags can help visual image recognition. The most naive way to use geo-information for image recognition is adding 2-dim location vectors consisting of values of latitude and longitude to image feature vectors extracted from a given photo. However, this method is not effective except for some concepts associated with specific places “Disneyland” and “Tokyo tower”, because objects represented by general nouns such as “river” R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 360–373, 2011. Springer-Verlag Berlin Heidelberg 2011
Geotagged Image Recognition
361
and “lion” can be seen at many places. To make location vectors to improve image recognition performance, we have to prepare all the locations where they can be seen in the database in advance. This requirement is too unrealistic to handle various kinds of concepts to be recognized. To make effective use of geo-information for image recognition, several methods to convert 2-dim location vectors into more effective features to be integrated with photo image features have been proposed so far [1–4]. Luo et al. [1] and Yaegas et al. [3, 4] proposed using aerial photos which are taken at the places corresponding to the geotags, and extracting visual features from them. On the other hand, Joshi et al. [2] proposed converting 2-dim geotag vectors into geoinformation texts using reverse geo-coding Web services such as geonemes.com, and forming bag-of-words vectors. However, there exists no work to make use of both of aerial visual features and geo-information text features as additional features so far. Then, in this paper, we propose integrating both features extracted from aerial photos and geo-information texts with visual features extracted from a given image. To fuse features of aerial photos and ones of geo-location texts with visual features of a given image, we use Multiple Kernel Learning (MKL). The MKL is also used to fuse aerial features and image features in [4]. Compared to [4], the main contributions of this paper are as follows: (1) We fuse various kinds of features derived from geotagged photos containing color visual features, geo-location texts, 2-dim geotag vectors and time-stamp features expressing the date and time when a photo is taken in addition to aerial features and image features extracted by grayscale-SIFT-based bag-of-features using MKL. (2) The experiments are more comprehensive regarding the number of concepts and combinations of features. By the experiments, we have confirmed the effectiveness of using both aerial features and geo-text features for image recognition by evaluating not only recognition rates, but feature fusion weights estimated by MKL. As a result, the mean average precision (MAP) for 28 categories increased up to 80.87% by the proposed method, compared with 77.71% by the baseline. Especially, for the categories related to landmarks, MAP was improved by 6.57. The rest of this paper is organized as follows: Section 2 describes about related work, and Section 3 and 4 explains the overview and the procedure of geotagged image recognition by fusing various kinds of features including ones extracted from photos, aerial photos and locate-related texts with Multiple Kernel Learning (MKL). Section 5 shows the experimental results and discusses them, and we conclude this paper in Section 6.
2
Related Work
To utilize geotags in visual image recognition, there are four papers proposed as mentioned in the previous section [1–4]. To utilize geotags in visual image recognition, three kinds of methods have been proposed so far: (1) combining values of latitude and longitude with visual features of a photo image [3], (2) combining visual feature extracted from aerial
362
K. Yaegashi and K. Yanai
photo images with visual feature extracted from a photo image [1, 3, 4], and (3) combining textual features extracted from geo-information texts obtained from reverse geo-coding services [2]. The method (1) is relatively straightforward way, and it improved recognition performance only for concepts associated with specific places such as “Disneyland” and “Tokyo tower” in the experiments of [3]. On the other hand, in the method (2), we utilize aerial photo images around the place where a photo was taken as additional information on the place. Since “sea” and “mountain” are distributed all over the world, it is difficult to associate values of latitude and longitude with such generic concepts directly. Then, we regard aerial photo images around the place where the photo is taken as the information expressing the condition of the place, and utilize visual feature extracted from aerial images as yet another geographic contextual information associated with geotags of photos. Especially, for geographical concepts such as “sea” and “mountain”, using feature extracted from aerial photos was much more effective than using raw values of latitude and longitude directly [3]. Since in the method (2) of [3] they combined two kinds of feature vectors by simply concatenating them into one long vector, the extent of contributions of aerial photos for geotagged photo recognition was unclear. Since Luo et al.[1] focused 12 event concepts such as “baseball”, “at beach” and “in park” which can be directly recognized from aerial photos, in their experiments the results by using only aerial photos were much better than the results by using only visual features of photos in addition to revealing that the results by both features combined by late SVM fusion [5] outperformed both of the results. However, this results cannot be generalized to more generic concepts such as “flower” and “cat”, since most of generic concepts are unrecognizable directly from the sky. In [4], they proposed introducing Multiple Kernel Learning (MKL) to evaluate contribution of both features for recognition by estimating the weights of image features of photos and aerial images. MKL is an extension of Support Vector Machine (SVM), and makes it possible to estimate optimal weights to integrate different features with the weighted sum of kernels. In the experiments, they evaluated the weights of both features using MKL for eighteen concepts. The contribution weights between aerial features and visual photo features in terms of image categorization are expected to vary depending on target concepts. For the concepts which can be directly recognized from aerial photos such as “beach” and “park”, the contribution rates of aerial photos were more than 50%, while they are less than 30% for the unrecognizable concepts from aerial photos such as “noodle”, “vending machine” and “cat”. As work to incorporate textual geo-information, Joshi et al. [2] proposed using textual features as additional features of image recognition. The textual features are obtained by reverse geo-coding which are provided by geonames.com. They showed the effectiveness of using textual geo-informations for object recognition. However, the following questions are not explored: Which is more effective for geotagged image recognition, aerial images or geo-textual information? How about combining both of them with visual image features? Then, in this paper,
Geotagged Image Recognition
363
we propose combining various kinds of geo-information including 2-dim location vectors, aerial image features and geo-textual features by MKL, and reveal which features are effective for which concepts by analyzing the experimental results. As another kind of image recognition with geotags, place recognition has been proposed so far [6, 7]. In these work, geotags are used in only the training step. In the recognition step, the system recognizes not categories of images but places where the photo are taken. That is, geotags themselves are targets to be recognized. IM2GPS [6] is a pioneer work of place recognition. Kalogerakis et al. improved the performance of place recognition by using statistics of travel trajectories [7].
3
Overview
The objective of this paper is to fuse various kinds of features derived from geotagged photos including visual photo features, aerial features, geo-text features, 2-dim location vectors and date/time feature vectors by using Multiple Kernel Learning (MKL) and to evaluate the recognition performance and the contribution weights of various features for image recognition regarding various kinds of concepts. In this paper, we assume that image recognition means judging if an image is associated with a certain given concept such as “mountain” and “beach”, which can be regarded as a photo detector for a specific given concept. By combining many detectors, we can add many kinds of words as word-tags to images automatically. As representation of photo images, we adopt the bag-of-features (BoF) representation [8] and HSV color histogram. BoF has been proved that it has excellent ability to represent image concepts in the context of visual image recognition in spite of its simplicity. Regarding HSV color histogram, the effectiveness as additional features of BoF are also shown in [1]. As representation of aerial photos, we also adopt the bag-of-features representation of aerial photos around the geotag location. To adapt various kinds of concepts, we prepare four levels of aerial photos in terms of scale as shown in Figure 1, and extract a BoF vector from each of them. Regarding geo-information texts, we use Yahoo! Japan Local Search API to convert geotags into reverse geo-coded texts, and build 2000-dim Bag-ofWords (BoW) vectors by counting frequency of the top 2000 highly-frequent words. After obtaining feature vectors, we carry out two-class classification by fusing feature vectors with MKL. In the training step of MKL, we obtain optimal weights to fuse both features.
4
Methods
In this section, we describe how to extract images with visual features and various kinds of features derived from geotags and how to use them for image recognition.
364
K. Yaegashi and K. Yanai
Fig. 1. Correspondences between a geotagged photo and aerial images
4.1
Data Collection
In this paper, we obtain geotagged images for the experiments from Flickr by searching for images which have word tags corresponding to the given concept. Since sets of raw images fetched from Flickr always contain noise images which are irrelevant to the given concepts, we select only relevant images by hand. In the experiments, relevant images are used as positive samples, while randomlysampled images from all the geotagged images fetched from Flickr are used as negative samples. We select 200 positive samples and 200 negative samples for each concept. Note that in the experiments we collected only photos taken inside Japan due to availability of high-resolution aerial photos. After obtaining geotagged images, we collect aerial photos around the points corresponding to the geotags of the collected geotagged image with several scales from an online aerial map site. In the experiments, we collect 256 × 256 aerial photos in four different kinds of scales for each Flickr photo as shown in Figure 1. The largest-scale one (level 4) corresponds to an area of 497 meters square, the next one (level 3) corresponds to an area of 1.91 kilometers square, the middle one (level 2) corresponds to a 7.64 kilometer-square area, and the smallest-scale one (level 1) corresponds to a 30.8 kilometer-square area. In addition, to obtain textual features, we gather geo-information texts using reverse geo-coding services via Yahoo! Japan Local Search API, which transform a 2-dim geo-location vector into landmark names around the place indicated by the location vector within 500 meters. For example, we can obtain names of elementary schools, hospitals, hotels, buildings, parks and temples. 4.2
BoF Features
We extract bag-of-features (BoF) [8] vectors from both photos and aerial images. The main idea of the bag-of-features is representing images as collections of independent local patches, and vector-quantizing them as histogram vectors. The main steps to build a bag-of-features vector are as follows: 1. Sample many patches from all the images. In the experiment, we sample patches on a regular grid with every 10 pixels. 2. Generate local feature vectors for the sampled patches by the SIFT descriptor [9] with four different scales: 4, 8, 12, and 16. 3. Construct a codebook with k-means clustering over extracted feature vectors. We construct a codebook for photo images for each given concept
Geotagged Image Recognition
365
independently, while we construct a codebook for aerial images which is common among all the aerial images for any concepts. We set the size of the codebook k as 1000 in the experiments. 4. Assign all feature vectors to the nearest codeword (visual word) of the codebook, and convert a set of feature vectors for each image into one k-bin histogram vector regarding assigned codewords. 4.3
Color Histogram
SIFT-based BoF does not include color information at all. Then, to use color information for recognition, we use a HSV color histogram in addition to BoF as visual features of photos. A color histogram is a very common image representation. We divide an image into 5 × 5 blocks, and extract a 64-bin HSV color histogram from each block with dividing the HSV color space into 4 × 4 × 4 bins. Totally, we extract a 1600-dim color feature vector which is L1-normalized from each image. 4.4
Raw Location Vectors
As one of features to be integrated by MKL, we prepare raw location vectors consisting of the values of latitude and longitude. Since a geotag represents a pair of latitude and longitude, it can be treated as a location vector as it is without any conversion. As a coordinate system, we use the WGS84 system which is the most common. 4.5
Aerial Photo Features
As described before, we use the bag-of-features (BoF) representation as features extracted from aerial photos. The way to extract BoF is the same as visual features from photos. We convert each of four different scales of 256 × 256 aerial images (Figure 1) the center of which correspond to the geotagged location into a BoF vector. Note that the visual codebook for aerial images is constructed based of a set of SIFT vectors extracted from all the collected aerial images. 4.6
Geo-information Textual Features
To obtain textual features, we gather geo-information texts using reverse geocoding services via Yahoo! Japan Local Search API. For all the gathered geotagged images, we get geo-location texts, and select the top 2000 highly-frequent words as the codewords for bag-of-words (BoW) representation. We count the frequency of each word regarding the selected 2000 words, and generate a BoW histogram for each image. 4.7
Date Features
In addition to geo-location features, we prepare date/time features based on time stamp information embedded in image files. To represent date and time,
366
K. Yaegashi and K. Yanai
we prepare two histograms on month and hour. For a month histogram, we divide 12 months into 12 bins, and for a hour histogram, we divide 24 hours into 24 bins. For soft weighting, we vote 0.5 on the corresponding bin and 0.25 on the neighboring bins. Finally we concatenate both a month histogram and a hour histogram into one 36-dim vector. 4.8
Multiple Kernel Learning
In this paper, we carry out two-class classification by fusing visual features of photo images and aerial images with Multiple Kernel Learning (MKL). MKL is an extension of a Support Vector Machine (SVM). MKL handles a combined kernel which is a weighted liner combination of several single kernels, while a standard SVM treats with only a single kernel. MKL can estimates optimal weights for a linear combination of kernels as well as SVM parameters simultaneously in the train step. The training method of a SVM employing MKL is sometimes called as MKL-SVM. Recently, MKL-SVM is applied into image recognition to integrate different kinds of features such color, texture and BoF [10, 11]. However, the recent work employing MKL-SVM focuses on fusion of different kinds of features extracted from the same image. This is different from our work that MKL is used for integrating features extracted from the different sources, which are photos and aerial images. With MKL, we can train a SVM with a adaptively-weighted combined kernel which fuses different kinds of image features. The combined kernel is as follows: Kcomb (x, y) =
K
βj Kj (x, y) with
j=1
K
βj = 1,
j=1
where βj is weights to combine sub-kernels Kj (x, y). As a kernel function, we used a χ2 RBF kernel, which was commonly used in object recognition tasks, for all the histogram-based feature vectors except raw location vectors. It is represented by the following equation: D K (x, x ) = exp −γ |xi − xi |2 /|xi + xi |
(1)
i=1
where γ is a kernel parameter. Zhang et al. [12] reported that the best results were obtained in case that they set the reciprocal of the average of χ2 distance between all the training data to the parameter γ of the χ2 RBF kernel. We followed this method to set γ. For 2-dim location vectors, we use the following normal RBF kernel, because it is not a histogram-based feature: D K (x, x ) = exp −γ |xi − xi |2
(2)
i=1
For this kernel, we also set the reciprocal of the average of L2 distance between all the training data to the parameter γ.
Geotagged Image Recognition
367
Sonnenburg et al.[13] proposed an efficient algorithm of MKL to estimate optimal weights and SVM parameters simultaneously by iterating training steps of a standard SVM. This implementation is available as the SHOGUN machine learning toolbox at the Web site of the first author of [13]. In the experiment, we use the MKL library included in the SHOGUN toolbox as an implementation of MKL.
5 5.1
Experiments 28-Category Dataset
For experiments, we prepared twenty-eight concepts, a part of which are shown in Figure 2. To select the twenty-eight concepts, we first define eight rough types of concepts and then select several concepts for each type as follows: Location-specific concept (LS) [2 concepts] Disneyland, Tokyo tower The locations related to these concepts are specific to them. Since all the aerial photos related to these concepts are similar to each other, using aerial photos are expected to improve recognition performance much. Moreover, 2-dim raw location vectors are also expected to work well. Landmark concept (LM) [5 concepts] bridge, shrine, building, castle, railroad Since there are possibility of recognizing these concepts on aerial photos directly, improvement of performance is expected. Geographical concept (GE) [3 concepts] lake, river, beach These concepts are also expected to be recognizable from the sky directly. Space concept (SP) [3 concepts] park, garden, landscape These concepts represents space which consists of various elements. It it difficult to expect if aerial features work or not. Outdoor artifact concept (OA) [5 concepts] statue, car, bicycle, graffiti, vending machine These concepts are almost impossible to recognize on aerial images, since some are very small and the others are movable. Time-dependent concept (TD) [5 concepts] sunset, cherry blossom, red leaves, festival, costume-play festival The first three concepts of five depend on time or seasons rather than locations, while “Festival” and “costume-play festival” depend on both time and locations. Creature concept (CR) [3 concepts] cat, bird, flower These concepts are very difficult to recognize on aerial images. Food concept (FD) [2 concepts] sushi, ramen noodle These concepts are impossible to recognize on aerial images. We cannot expect improvement by additional features derived from geotags.
368
K. Yaegashi and K. Yanai
Fig. 2. Eighteen concepts of twenty-eight concepts for the experiments
Location-specific, landmark and geographical concepts are expected to have correlation to aerial images, while creature, food and time-dependent are to have no or less correlation to aerial images. We gathered photo images associated to these concepts from Flickr, and selected 200 positive sample images for each concepts. As negative sample images, we selected 200 images randomly from 100,000 images gathered from Flickr. 5.2
Combinations of Features
Totally, in this paper, we use nine features containing BoF ans HSV color histogram of photos, BoF of aerial images in four scales, geo-text features, raw location vectors and time features. In the experiments, we prepare eleven kinds of their combinations shown in Table 1, and provide each of them to Multiple Kernel Learning for training of it. Note that we regard four levels of aerial features as one feature for evaluation, although we provide four levels of aerial features to MKL independently. 5.3
Evaluation
In the experiments, we carried out two-class image classification and estimate weights to integrate BoF and color features of photos and various features derived from geotags using MKL for twenty-eight concepts. We evaluated experimental results with five-fold cross validation using the average precision (AP) which is computed by the following formula: AP =
N 1 P rec(i), N i=1
(3)
where P rec(i) is the precision rate of the i positive images from the top, and N is the number of positive test images for each fold.
Geotagged Image Recognition
369
Table 1. Eleven kinds of combinations of features combinations BoF (BoF) HSV (HSV) Vis (Vis) Vis+G (VG) Vis+T (VT) BoF+A (BA) Vis+A (VA) Vis+A+T (VAT) Vis+G+T+D (VGTD) Vis+A+T+D (VATD) All (All)
5.4
BoF HSV Aerial(A) Geo-location(G) Texts(T) Date(D) O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
Experimental Results
In this subsection, we describe the result of geotagged image recognition by MKL-SVM in terms of eleven kinds of the feature combinations as shown in Table 1. Table 2 shows the average precision (AP) of each recognition result. Each column in this table corresponds to each abbreviation representing the combination of features shown in Table 1. The results by Vis(BoF+HSV) can be regarded as the baseline results, since they are obtained using only visual features extracted from photos. As a result, the mean average precision (MAP) for 28 categories increased up to 80.87% by the proposed method, compared with 77.71% by the baseline. Especially, for the categories related to location-dependent concepts, MAP was improved by 6.57 points. To clarify the obtained gains by introducing geo-related features, we showed the differences of AP values between the baseline (Vis) results and the results obtained by combining photo features with some geo-related features in Table 3. In terms of the average of the gains, the gain in case of using raw location vectors was small value, 0.89, while we obtained the best gain, 3.15, in case of AT (aerial features and text features). This implies the effectiveness of transforming raw location vectors into aerial and textual features. Next, we explain the obtained gain regarding each type of concepts. For “Location-dependent concepts”, ATD (all the features except location features) achieved the largest value, 6.67, which are reasonable results. Regarding raw location vectors, they worked for only Location-dependent concepts. This also implied the requirement to convert them to aerial or geo-text features. Next, we show the weight of each feature estimated by MKL in case of fusing all the features in Table 5, which can be regarded as expressing relative discriminative power of each feature. Regarding the average weight over all the concepts, the weight for photo features were the largest, 0.4717, the second largest was the weight of geo-text features, 0.2915, and the third was the one of aerial features, 0.1411. This indicated that geo-text features contained more geo-context information which helps image recognition than aerial features in general. However, geo-text features has a problem that it depends heavily on the output of reverse
370
K. Yaegashi and K. Yanai
Table 2. The average precision of geotagged image recognition for 28 categories with eleven kinds of feature combinations Type of concepts Location-dep.
concept
Disneyland Tokyo tower AVG Landmark bridge building castle shrine railroad AVG Geographical beach lake river AVG Space park garden landscape AVG Outdoor bicycle artifact car vending machine statue graffiti AVG Time-dep. costume-play red leaves festival cherry blossom sunset AVG Creature flower bird cat AVG Food ramen noodle sushi AVG Total AVG
BoF HSV Vis 68.37 79.34 73.86 69.78 78.58 80.13 70.32 74.43 74.65 81.02 78.27 74.56 77.95 70.42 77.57 74.23 74.08 76.05 70.80 79.88 66.45 72.59 73.15 75.67 80.77 73.31 80.22 82.90 78.57 77.98 69.75 67.96 71.89 82.59 79.85 81.22 75.49
70.98 80.59 75.79 63.87 74.35 81.21 70.60 72.59 72.52 80.06 76.25 74.22 76.84 69.05 77.61 74.28 73.65 71.85 70.74 82.81 65.94 75.60 73.39 79.42 82.53 74.89 79.70 83.41 79.99 77.70 71.32 67.24 72.09 80.85 79.51 80.18 75.33
73.20 81.54 77.37 69.78 79.92 82.09 71.34 77.71 76.17 81.70 79.41 75.81 78.97 71.24 79.80 75.75 75.60 76.72 75.72 83.26 67.78 76.68 76.03 80.29 83.27 76.90 82.13 83.68 81.25 80.71 73.00 71.72 75.14 83.03 81.76 82.40 77.71
VG
VT
BA
VA
78.41 82.28 80.34 71.95 80.64 82.91 73.11 77.35 77.19 81.99 80.53 76.74 79.75 71.95 81.06 76.73 76.58 77.22 76.18 83.11 68.43 78.27 76.64 81.53 83.51 77.30 82.29 83.71 81.67 81.02 74.02 73.43 76.16 83.09 82.09 82.59 78.60
84.16 83.95 84.05 74.76 81.77 83.72 77.28 79.25 79.36 83.76 82.87 80.61 82.41 78.29 82.39 78.27 79.65 79.76 77.54 83.22 72.81 81.27 78.92 83.95 83.83 80.19 82.91 83.83 82.94 81.53 81.38 74.63 79.18 83.28 83.18 83.23 80.87
84.14 83.67 83.90 74.32 79.80 82.99 75.01 77.03 77.83 83.50 81.42 80.11 81.68 74.99 81.06 75.16 77.07 77.90 74.84 81.59 70.78 79.53 76.93 83.68 82.08 76.20 80.56 82.83 81.07 78.30 78.89 69.62 75.60 82.77 82.09 82.43 79.10
84.14 83.73 83.93 75.43 81.41 83.77 76.72 78.62 79.19 83.54 82.42 81.22 82.39 77.63 82.35 78.12 79.36 78.80 76.54 83.08 72.94 81.27 78.53 84.04 83.82 79.59 82.67 83.83 82.79 81.31 80.89 74.60 78.93 83.18 83.15 83.17 80.67
VAT VGTD VATD 84.17 83.75 83.96 76.44 81.26 83.75 77.15 78.70 79.46 83.61 82.67 81.24 82.51 78.76 82.72 78.37 79.95 79.39 76.72 83.03 73.25 81.53 78.78 84.04 83.77 79.97 82.77 83.78 82.86 81.24 81.21 74.63 79.03 83.11 83.07 83.09 80.86
84.10 83.78 83.94 74.65 81.35 83.74 76.08 78.83 78.93 83.72 82.67 80.06 82.15 77.76 82.20 78.00 79.32 78.87 77.01 83.39 72.34 81.81 78.69 83.73 83.96 79.23 82.80 83.86 82.72 82.23 80.36 74.23 78.94 83.00 83.20 83.10 80.61
84.16 83.91 84.03 74.83 80.90 83.79 77.05 79.04 79.12 83.62 82.78 80.93 82.44 77.53 82.59 78.20 79.44 79.45 77.05 83.10 72.27 81.82 78.74 84.09 83.91 80.08 82.88 83.90 82.97 82.65 81.20 73.93 79.26 83.10 83.28 83.19 80.79
All 84.09 83.78 83.93 75.12 81.04 83.73 76.20 78.98 79.02 83.55 82.39 80.78 82.24 77.33 82.27 78.06 79.22 78.80 76.70 83.28 72.49 81.42 78.54 83.73 83.97 79.58 82.55 83.84 82.73 82.22 80.26 74.14 78.87 82.97 83.19 83.08 80.59
geo-coding service. For the area where geo-texts are poorly available such as the sea or the undeveloped areas, it might be useless. On the other hand, aerial photos are available anywhere on the earth. In this sense, aerial features can be regarded as more solid and stable features than geo-text features. Actually, as show in Table 3, the gain of aerial features is comparable to the gain of geotext features. The weights of location vectors were relatively small, because raw coordinate features are not effective to describe location context on the place. This shows that transformation of geotag vectors into aerial features or geo-text features are very effective methods to utilize geo-context information on the place. The weights of date features is also small but larger than raw geo-location features. Regarding each concept, most of the concepts gathered the largest weights on photo features, while the eight concepts obtained the largest weights on geotext features. For “beach”, the weight of aerial features was largest, which is a reasonable result, because “beach” is easy to recognize in the aerial photos as the border between lands and the sea. For “statue”, the weight of raw location
Geotagged Image Recognition
371
Table 3. The obtain gains calculated from the difference to the baseline(Vis) in terms of MAP combination Geo Text Air AT GTD ATD All
calculation (Vis+G)−(Vis) (Vis+T)−(Vis) (Vis+A)−(Vis) (Vis+A+T)−(Vis) (Vis+G+T+D)−(Vis) (Vis+A+T+D)−(Vis) (All)−(Vis) genre Location-dep. Artifact Landmark Space Time-dep. Food Geographical Creature AVG
gain gain gain gain gain gain gain
Geo 2.97 0.61 1.03 0.98 0.41 0.19 0.78 1.02 0.89
by by by by by by by
Text 6.69 2.89 3.19 4.05 1.69 0.84 3.44 4.04 3.16
Overview raw location vectors geo-text features aerial features aerial features and geo-text features all except aerial features all except raw location features all the features Air 6.56 2.49 3.02 3.77 1.53 0.77 3.42 3.79 2.96
AT 6.60 2.75 3.30 4.35 1.61 0.69 3.53 3.88 3.15
GTD 6.57 2.65 2.76 3.72 1.46 0.71 3.18 3.80 2.89
ATD 6.67 2.71 2.95 3.84 1.72 0.80 3.47 4.11 3.08
All 6.57 2.51 2.85 3.62 1.48 0.68 3.27 3.73 2.88
Table 4. Part of obtained average geo-text features from Yahoo! Japan Local API. We show the top ten words in the descending order of the value of the corresponding bins for the three concepts the geo-text kernel weights of which are relatively larger and smaller.
vectors are biggest, since the locations of famous statues are limited and can be covered with the raw location vectors in the training data. For “cherry blossom”, the weight of date features is the largest, since the season when cherry blossom blooms is only spring from late March to late April. In addition, we show part of the geo-text features averaged over each concept in Table 4. In this table, capitalized words represent place names. Note that this table includes many place names in Japan, because we limited geotagged images taken within Japan when fetching images via Flickr API. Geo-text features were effective for the concepts the representative location of which are fixed, while they are not effective for the concepts existing anywhere such as “flower” and “vending machine”. From the above observations, we conclude that geo-text features are the most effective and aerial features are the second, while raw location features and date features are not so helpful.
372
K. Yaegashi and K. Yanai
Table 5. Feature weights estimated by MKL in case of fusing all the features. The number written in the red ink is the largest weight among each concept. category Locationdependent
aerial
location texts
time
bridge shrine building castle railroad AVG
0.3843(0.2690,0.1153) 0.4504(0.0999,0.3505) 0.6147(0.3897,0.2251) 0.2441(0.0654,0.1787) 0.6918(0.3016,0.3903) 0.4771(0.2251,0.2519)
0.3408(0.0529,0.0557,0.0295,0.2028) 0.0690(0.0012,0.0020,0.0001,0.0657) 0.1207(0.0500,0.0425,0.0004,0.0278) 0.1094(0.0877,0.0041,0.0124,0.0053) 0.1004(0.0212,0.0151,0.0004,0.0637) 0.1480(0.0426,0.0239,0.0085,0.0731)
0.0089 0.0043 0.0455 0.0521 0.0001 0.0222
0.2289 0.4307 0.2123 0.5941 0.1869 0.3306
0.0371 0.0456 0.0068 0.0002 0.0208 0.0221
lake river beach AVG
0.3539(0.2184,0.1355) 0.3465(0.2492,0.0973) 0.1888(0.1816,0.0073) 0.2964(0.2164,0.0800)
0.2694(0.0749,0.0877,0.0150,0.0918) 0.3181(0.0063,0.0523,0.0614,0.1981) 0.5003(0.0108,0.1632,0.0013,0.3250) 0.3626(0.0307,0.1011,0.0259,0.2049)
0.0040 0.0108 0.0015 0.0054
0.3435 0.2447 0.2934 0.2938
0.0292 0.0800 0.0159 0.0417
park garden landscape AVG
0.3971(0.1688,0.2283) 0.4740(0.1338,0.3402) 0.5634(0.2895,0.2739) 0.4782(0.1974,0.2808)
0.0881(0.0034,0.0101,0.0205,0.0541) 0.1227(0.0080,0.0033,0.0170,0.0944) 0.0452(0.0000,0.0001,0.0078,0.0374) 0.0853(0.0038,0.0045,0.0151,0.0620)
0.0169 0.0269 0.0043 0.0161
0.4178 0.3512 0.3587 0.3759
0.0800 0.0252 0.0283 0.0445
Outdoor artifact
statue car bicycle graffiti vending machine AVG
0.2223(0.2156,0.0067) 0.6772(0.3290,0.3481) 0.5520(0.4410,0.1111) 0.2304(0.0682,0.1621) 0.8755(0.2851,0.5904) 0.5115(0.2678,0.2437)
0.1381(0.0587,0.0041,0.0464,0.0288) 0.0285(0.0065,0.0000,0.0001,0.0219) 0.0642(0.0076,0.0143,0.0003,0.0420) 0.2846(0.0141,0.1064,0.1043,0.0598) 0.0533(0.0064,0.0127,0.0094,0.0248) 0.1137(0.0187,0.0275,0.0321,0.0355)
0.4492 0.0146 0.0098 0.0493 0.0032 0.1052
0.0336 0.2678 0.3391 0.3728 0.0085 0.2044
0.1568 0.0119 0.0349 0.0630 0.0594 0.0652
Timedependent
red leaves cherry blossom sunset costume-play festival AVG
0.6158(0.2298,0.3861) 0.4437(0.3291,0.1147) 0.8096(0.4022,0.4074) 0.0485(0.0352,0.0133) 0.5667(0.2347,0.3320) 0.4969(0.2462,0.2507)
0.1490(0.0836,0.0297,0.0037,0.0318) 0.0402(0.0002,0.0019,0.0381,0.0001) 0.0635(0.0001,0.0004,0.0000,0.0631) 0.0036(0.0001,0.0001,0.0001,0.0033) 0.1248(0.0012,0.0004,0.0005,0.1226) 0.0762(0.0170,0.0065,0.0085,0.0442)
0.0066 0.0016 0.0000 0.0126 0.0015 0.0045
0.0025 0.0093 0.0002 0.9291 0.2154 0.2313
0.2261 0.5051 0.1267 0.0062 0.0916 0.1911
cat bird flower AVG
0.6929(0.2690,0.4239) 0.1869(0.0888,0.0980) 0.8196(0.3356,0.4840) 0.5665(0.2312,0.3353)
0.0718(0.0002,0.0032,0.0007,0.0677) 0.0296(0.0004,0.0039,0.0002,0.0252) 0.0063(0.0001,0.0007,0.0015,0.0041) 0.0359(0.0002,0.0026,0.0008,0.0323)
0.0100 0.0000 0.0179 0.0093
0.2217 0.7126 0.0269 0.3204
0.0036 0.0708 0.1293 0.0679
Landmark
Geographical
Space
Creature
Food
ramen noodle 0.9317(0.4588,0.4729) 0.0563(0.0000,0.0000,0.0092,0.0470) 0.0018 0.0005 0.0097 sushi 0.6789(0.2435,0.4354) 0.0021(0.0001,0.0005,0.0000,0.0016) 0.0043 0.2616 0.0532 AVG 0.8053(0.3512,0.4541) 0.0292(0.0001,0.0002,0.0046,0.0243) 0.0030 0.1310 0.0314 Total AVG
6
photo(BoF+HSV)
Disneyland 0.0001(0.0001,0.0001) 0.3288(0.1460,0.1827,0.0001,0.0001) 0.0004 0.6706 0.0000 Tokyo tower 0.1479(0.1474,0.0005) 0.4230(0.0000,0.0011,0.4217,0.0001) 0.0007 0.4280 0.0003 AVG 0.0740(0.0737,0.0003) 0.3759(0.0730,0.0919,0.2109,0.0001) 0.0006 0.5493 0.0002
0.4717(0.2314,0.2403) 0.1411(0.0229,0.0285,0.0286,0.0611) 0.0271 0.2915 0.0685
Conclusion
In this paper, we proposed introducing Multiple Kernel Learning (MKL) into geotagged image recognition to estimate the contribution weights of visual features of photo images and geo-information features. In the experiments, we made experiments with twenty-eight concepts selected from eight different types of concepts. The experimental results showed that using geo-textual features and aerial features can be regarded as very helpful for most of the concepts except “food” concepts. As a result, the mean average precision (MAP) for 28 categories increased up to 80.87% by the proposed method, compared with 77.71% by the baseline. For future work, we plan to make more large-scale experiments with much more categories. We also plan to carry out multi-class classification experiments. Acknowledgement. Keita Yaegashi is currently working for Rakuten Institute of Technology.
Geotagged Image Recognition
373
References 1. Luo, J., Yu, J., Joshi, D., Hao, W.: Event recognition: Viewing the world with a third eye. In: Proc. of ACM International Conference Multimedia (2008) 2. Joshi, D., Luo, J.: Inferring generic activities and events from image content and bags of geo-tags. In: Proc. of ACM International Conference on Image and Video Retrieval (2008) 3. Yaegashi, K., Yanai, K.: Can geotags help image recognition? In: Proc. of PacificRim Symposium on Image and Video Technology (2009) 4. Yaegashi, K., Yanai, K.: Geotagged photo recognition using corresponding aerial photos with multiple kernel learning. In: Proc. of IAPR International Conference on Pattern Recognition (2010) 5. Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Mei, T., Zhang, H.: Correlative multi-label video annotation. In: Proc. of ACM International Conference Multimedia, pp. 17– 26 (2007) 6. Hays, J., Efros, A.A.: IM2GPS: Estimating geographic information from a single image. In: Proc. of IEEE Computer Vision and Pattern Recognition (2008) 7. Kalogerakis, E., Vesselova, O., Hays, J., Efros, A., Hertzmann, A.: Image sequence geolocation with human travel priors. In: Proc. of IEEE International Conference on Computer Vision (2010) 8. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual categorization with bags of keypoints. In: Proc. of ECCV Workshop on Statistical Learning in Computer Vision, pp. 59–74 (2004) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 10. Varma, M., Ray, D.: Learning the discriminative power-invariance trade-off. In: Proc. of IEEE International Conference on Computer Vision, pp. 1150–1157 (2007) 11. Gehler, P., Nowozin, S.: On feature combination for multiclass object classification. In: Proc. of IEEE International Conference on Computer Vision (2009) 12. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision 73, 213–238 (2007) 13. Sonnenburg, S., R¨ atsch, G., Sch¨ afer, C., Sch¨ olkopf, B.: Large Scale Multiple Kernel Learning. The Journal of Machine Learning Research 7, 1531–1565 (2006)
Modeling Urban Scenes in the Spatial-Temporal Space Jiong Xu1 , Qing Wang1 , and Jie Yang2 1
School of Computer Science and Engineering Northwestern Polytechnical University, Xi’an 710072, P.R. China 2 School of Computer Science Carnegie Mellon University, Pittsburgh, PA 15213, USA
Abstract. This paper presents a technique to simultaneously model 3D urban scenes in the spatial-temporal space using a collection of photos that span many years. We propose to use a middle level representation, building, to characterize significant structure changes in the scene. We first use structure-from-motion techniques to build 3D point clouds, which is a mixture of scenes from different periods of time. We then segment the point clouds into independent buildings using a hierarchical method, including coarse clustering on sparse points and fine classification on dense points based on the spatial distance of point clouds and the difference of visibility vectors. In the fine classification, we segment building candidates using a probabilistic model in the spatial-temporal space simultaneously. We employ a z-buffering based method to infer existence of each building in each image. After recovering temporal order of input images, we finally obtain 3D models of these buildings along the time axis. We present experiments using both toy building images captured from our lab and real urban scene images to demonstrate the feasibility of the proposed approach.
1
Introduction
Recent advance of research in computer vision and photogrammetry has made it possible for reconstructing large scale urban scenes from images and videos. Real world applications such as Google Earth and Microsoft Virtual Earth used aerial and satellite images to obtain large scale 3D models. The project, Building Rome in a Day [1], demonstrated reconstruction of 3D scenes from large collections of photographs for a given city on Internet photo sharing sites. In this research, we are interested in modeling urban scenes in the spatial-temporal space using a collection of photos that span many years. The models can not only reconstruct 3D structures of urban scenes but also reflect the structure changes in these urban scenes over the time. The developed models can provide a tool for many applications such as city virtual tourism and help for city planning. Large scale 3D reconstruction from images has been explored recently in the computer vision literature [1, 2, 9, 10, 16]. The availability of billions of photographs on the Internet has made the work possible. These images might differ R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 374–387, 2011. Springer-Verlag Berlin Heidelberg 2011
Modeling Urban Scenes in the Spatial-Temporal Space
375
in many aspects, such as viewing positions and angles, lighting conditions (different time during the day or night), and background (changes in season, weather, andenvironments). However, a majority of work so far has focused only on the still scenes and ignored scene changes over the time. Some urban structures may change significantly over the time because of reconstructions or new constructions. Schindler et al. attempted to temporally sort a collection of photos that span many years [11]. They proposed an approach to infer temporal order of images from 3D structure by solving constraint satisfaction problem (CSP). They validated the algorithm on two small sets of images with manually selected and matched features. However, the method might fail for large scale problems when the number of images becomes huge. In fact, if the number of 3D points and the number of images become large, the visibility matrix becomes huge, which can lead to difficulties in computation of the visibility matrix or failure of the CSP. Another problem is that visibility reasoning on 3D points is unstructured and without considering semantics. More recently in [12], Schindler and Dellaert proposed a probabilistic temporal inference framework for recovery of a timevarying structure from images. They reconstructed 3D scenes that were composed of sparse points obtained by SfM. The obtained models, however, may lose some details of reconstructed scenes, and lead to inaccuracy of visibility reasoning, especially when the shape of buildings is irregular. In this research, we attempt to simultaneously model urban scenes in the spatial-temporal space. In order to handle a large number of 3D points for real world applications, we use a divide and conquer strategy by adding a middle level representation, building, to characterize significant structure changes in the scene. That is, we consider 3D points belonging to different buildings if 3D structures of the scene have been significantly changed. Then the problem becomes two sub problems: the spatial-temporal segmentation of 3D points and building based temporal order inference. In the first sub-problem, we have to address the challenge of modeling urban scenes in the spatial-temporal space that the 3D point clouds reconstructed by SfM is a mixture of scenes from different years. It is difficult to segment the different buildings in the same place by using a traditional method. Most previous work on 3D segmentation was performed on range data [4, 6], which is usually based on the local geometric characterizations. In recent years, some researchers utilized 3D information for image segmentation [13, 15]. In this research, we use both 3D and 2D information for building segmentation. In the second sub-problem, we need to address the challenge in reasoning the existence of building, to determine which buildings exist in each image. For a single 3D point, it is not difficult to classify its status as visible, missing, out of view or occluded. However, for the building, it is difficult to distinguish the status between missing and occluded. In this paper, we present an algorithm that automatically models urban scenes in the spatial-temporal space from historical images. Our main contribution is to use a middle level representation to describe significant structure changes in the scene. Using this representation, we can greatly reduce computational
376
J. Xu, Q. Wang, and J. Yang
complexity of the visibility matrix in referring temporal orders of images. We first infer the position of cameras and scene structure in both the space and time. Traditional SfM techniques are employed to build an initial coarse 3D model consisting of unordered points without considering temporal information. After removing noises from the data, we perform segmentation on the dense 3D points with additional semantic information from 2D images to extract independent buildings. Furthermore, we hierarchically segment building using coarse clustering and fine classification methods. First, we employ region growing to cluster 3D points into independent building candidates by spatial distance and visibility of 3D points. Based on both the difference of visibility vectors of point and building candidates in the images, we refine extraction of independent buildings from dense points by a probabilistic model, which combines the joint probability of a point belonging to different buildings at different periods of time. For each image, we use a z-buffering based approach to reason the existence of buildings segmented in the previous step. By using the visibility matrix of buildings with images, the problem is formulated as a constraint satisfaction problem to recover temporal order of images and 3D models. The rest of the paper is organized as follows. Section 2 briefly provides an overview of our approach. Section 3 describes the process of obtaining the middle level representation. Section 4 introduces visibility reasoning of building. Section 5 discusses recovering the temporal ordering of the images and generating time-varying 3D models. Experimental results are presented in Section 6. Finally, we draw conclusions in Section 7.
2
The Approach
Our objective is to geometrically register large photo collections and to infer the temporal ordering of these photos for a number of applications in visualization, localization, image browsing, and other areas. In a large scale scene reconstruction application, if the scene is still, we can directly use the SfM method dealing with unordered input images. If the image collection contains photos with structure changes over the time, it results in a mixture of scenes with different structures in the same location, which can be considered as the loss of the temporal information. The primary technical challenge is to automatically infer temporal ordering of images from a large collection with large variations in viewpoint, illumination, etc. This section provides an overview of our approach. Fig. 1 depicts an overview of our proposed approach. The left of Fig. 1 is the input of a historical photo collection of the Bell Tower in the downtown of Xi’an City in China taken in 1980’s, 1990’s, and 2000’s, and the right is the output of 3D models with the recovered time order (yellow ellipse highlighted). It can be observed that structures of the building at the southeastern corner of the Bell Tower have significantly changed over the thirty year period of time. In the middle of Fig. 1, the process flow shows the detailed steps of 3D modeling in the spatial-temporal space. From top to bottom, sparse scene geometries and dense 3D point clouds are first reconstructed by SfM automatically (in [11],
Modeling Urban Scenes in the Spatial-Temporal Space
377
Fig. 1. An overview of the time-varying 3D modeling in the spatial-temporal space
a small set of local feature point is performed manually). We then segment the mixed points into independent buildings along the time axis in the spatialtemporal space. We employ a z-buffering based method to infer existence of each building in each image. After recovering temporal order of input images, we finally reconstruct 3D models of these buildings. The highlighted steps in red color in Fig. 1 are building segmentation and visibility reasoning, which will be discussed in details in next two sections.
3
Building Segmentation
Unlike some natural phenomena such as water and wind, which are stochastically stationary and can be simulated relying on physical models [3], or animal motion, which is highly regular and repetitive and can introduce a function to describe its motion cycle [17], scene changes in a city are more irregular and un-predictable. A city can remain its appearance for years, and also can be changed rapidly, especially in developing areas or cities. Our goal is to recover 3D structures and the temporal ordering of a set of n unordered images simultaneously. A feasible solution is to infer different sets of 3D structures existing in the same place in different periods of time. In order to characterize different structures over the time, we introduce a middle level representation, building. Then our problem becomes (1) to segment dense 3D points into independent buildings, and (2) to infer the visibility of each building in each image. By introducing this middle level representation, we can greatly reduce computational complexity in inferring the temporal ordering of the unordered images. Given a set of n unordered images of urban scenes I1 , I2 , . . . , In , we first use the Bundler software1 to generate sparse scene geometry [14]: M1 sparse 3D points s X1s , X2s , . . . , XM and projection matrices P1 , P2 , . . . , Pn for each of n cameras 1 1
http://phototour.cs.washington.edu/bundler/
378
J. Xu, Q. Wang, and J. Yang
C1 , C2 , . . . , Cn corresponding to images I1 , I2 , . . . , In . Then we use the software PMVS2 (Patch-based Multi-view Stereo) on the output to generate dense point d clouds [7]. After that, we obtain M2 dense 3D points X1d , X2d , . . . , XM . Both 2 sparse and dense 3D points are represented by X. In the initial coarse 3D model, there exist many noisy points needed to be removed. In our current work, we apply a point clustering approach to remove noises as much as possible. However, even with the clean data, we may not be able to partition the 3D points into independent buildings because these 3D points may come from different buildings at different period of time. Fortunately, input images hold additional valuable geometric information of the scene. Assuming two buildings B1 and B2 exist in the same place, it is impossible for them to exist in the same image since they belong to different periods of time. We propose to encode the visibility of building Bi in all images in a row −−→ 1 if Bi is visible in Ik vector VB i , where VBik = . The above observation 0 otherwise −−→ −−→ can be defined as VB i ∧ VB j = 0, if Bi and Bj have overlapping in the space. In the SfM process, a sparse point-image incidence matrix, indicating which points appear in which images, is produced. The visibility vector of 3D point → − Xi is represented by the corresponding row vector V i in this matrix, where 1 if Xi is visible in Ik Vik = . Thus we can determine whether two points are 0 otherwise belonging to the same building by both the spatial distance and the difference of visibility vectors and segment the overlapping building (mixtures of 3D points) into independent ones in the spatial-temporal space. Our building segmentation is carried out hierarchically in two steps: coarse clustering on sparse points and fine classification on dense points. In the phase of coarse clustering, we apply region growing on sparse 3D points to obtain independent building candidates. First, a point is selected as a seed randomly. d , Then in the region growing procedure, we define a distance threshold Tspace d e.g., 0.8 in this work. If the distance of two points is smaller than Tspace , we consider they are adjacent in the space but we cannot confirm they belong to the same building. We need to further check whether these two points are visible at least in one image simultaneously. If they are visible in one image, we can infer they are from the same building. Otherwise, they should belong to different buildings at the same place but in different periods of time. Therefore, the region membership criterion of growing is defined by the number of images in which two points are both visible. That is, we keep examining the adjacent points of seed points. If they can be simultaneously observed at least in one image, we can regard them as seed points. The process is terminated until there is no change in iteration. Finally an independent building candidate is extracted from sparse points. Repeat the procedure until all points have been labeled. A post filtering procedure is then followed to remove some small regions as noises. After clustering on sparse points, we need to carry out fine classification on dense points for building segmentation. Assuming that the sparse points are 2
http://grail.cs.washington.edu/software/pmvs/
Modeling Urban Scenes in the Spatial-Temporal Space
379
segmented into p buildings B1 , B2 , . . . , Bp , next we should classify the dense 3D points into these p buildings. Now this task becomes an issue of point classification. After Bundler, the most significant part of the building, usually the center part, is reconstructed since more images focused on this part when photographers took photos. Due to the reasons of out of view or occlusion, other parts of building may not be reconstructed as well as the center part. Thus, we assume spatial distribution of 3D sparse points as a Gaussian distribution and use the maximum likelihood estimation (MLE) to compute the mean and variance of building Bj based on 3D sparse points, μ ˆj =
1 Xi , mj
ˆj = 1 Σ (Xi − μ ˆj )(Xi − μ ˆj )T mj
Xi ∈Bj
(1)
Xi ∈Bj
where mj is the number of sparse points Xi belonging to building Bj . We model the probability of dense 3D point Xi belonging to building Bj in the space as, pspace (Xi |Xi ∈ Bj ) =
1 ˆ j |1/2 (2π)3/2 |Σ
1 ˆ −1 (Xi − μ exp[− (Xi − μ ˆj )T Σ ˆj )] j 2
(2)
As mentioned before, points belonging to different buildings may be very close in the space. Therefore, we impose a temporal probability to enhance the constraint for building segmentation, n
ptime (Xi |Xi ∈ Bj ) =
k=1
Vik ∧ VBjk
n
k=1
(3) VBjk
where Vik and VBjk are visibility vector element of point Xi and building Bj in −−→ image Ik , respectively. VB j is obtained by performing OR operation among all visibility vectors of the sparse points partitioned to Bj , that is, −−→ → − → − → − VB j = V i1 ∨ V i2 ∨ · · · ∨ V imj ,
Xi1 , Xi2 , . . . Ximj ∈ Bj
(4)
The final probability of dense 3D point Xi belonging to the building Bj is given by: (5) p(Xi |Xi ∈ Bj ) = pspace (Xi |Xi ∈ Bj )ptime (Xi |Xi ∈ Bj ) For each dense point, we compute its probability belonging to each building and choose the index of maximum probability as its category, that is, j ∗ = argmax p(Xi |Xi ∈ Bj ) j
(6)
Fig. 2 shows two examples of building segmentation on toy building data. The left of Fig. 2(a) and (b) is the mixture of sparse (up) and dense (lower) 3D points from different buildings in the same place at the different periods of time. The
380
J. Xu, Q. Wang, and J. Yang
Fig. 2. Examples of building segmentation on toy building data. Left: sparse and dense 3D points of two buildings which overlap with each other in the same space. Top Right: two images from the image set. Middle and Bottom Right: building candidates and final results of independent buildings by the spatial and temporal segmentation.
top-right of Fig. 2 (a) and (b) shows two samples from the image collection, which are corresponding to the mixture of 3D points. Using the spatial and temporal segmentation, independent buildings are separated from each other, shown in the middle and bottom of right of Fig. 2 (a) and (b). One may question why we do not segment buildings on dense points directly. The reason is that many visibility vectors of dense points are far away from the correct values after PMVS, which results in points having no intersection in the different periods of time but may be visible in the same image. Therefore, it is difficult to segment the overlapping buildings in the space on dense points directly. But our hierarchical method provides a stronger constraint to guide the classification of dense 3D points. Even if there are lots of errors in the original dense point-image incidence matrix, it can still obtain robust results by the probabilistic model. At the same time, the hierarchical method is also efficient in building segmentation. In general, the number of dense points is considerably huger than that of sparse ones. Performing the region growing step just on the sparse points will vastly reduce computational time.
4
Visibility Reasoning of Buildings
After we obtain independent buildings from mixed 3D points, we need to reason about the existence of buildings in the input images. Though the visibility of each building in each image is known, it is insufficient to infer temporal ordering of images. We need to further determine which buildings exist in each image. If a building is visible in an image, it must be existent at the time when that image was taken. Thus we only need consider invisible buildings in each image. Pani and Bhattacharjee reviewed important representational issues of a temporal reasoning system [8]. They analyzed the most influential approaches to temporal reasoning in artificial intelligence. In this work, we adopted the method that is similar to the method used by Schindler et al. [11] to classify each invisible
Modeling Urban Scenes in the Spatial-Temporal Space
381
building in each image as missing, out of view, or occluded. Missing means the building did not exist at the time when the image was captured. If the projection of a building falls outside the field of view of camera (as defined by the width and height of the corresponding images), the corresponding building is classified as out of view for that image. The projection bkj of building Bk in image Ij is a set of projections of all points belonging to Bk , bkj = {xij | Xi ∈ Bk }, xij = Pj Xi If the projection of a building is within the field of view of the camera, but it does not appear in the expected place as something blocks the view of it, this situation is called occluded. Although it is not hard to determine whether a building is out of view, it is difficult to distinguish between missing and occluded buildings. For each image Ij , we first project all the visible buildings in Ij to the image plane. Because a building consists of 3D points, the projection of a building is only a sparse sampling of the surface of the building. We might not be able to only use 3D points to determine whether a building is occluded by other buildings. A common way is to employ Delaunay triangulation to generate meshes [5]. In this paper, an image-space approach is used. We compute the convex hull for the projection of each building. Then we render a binary image maski for each camera Ci as masking image. The internal region of each convex hull is white and elsewhere is black. The region of convex hull in each masking image is called an occluder. A remaining issue is the unknown depth of occluders. That is, if the projection of a point falls within an occluder’s region, we have no way to know this point is whether in front of occluder or behind it. Thus we still cannot decide whether an object is occluded or not. Fortunately, we have the camera projection matrix as well as 3D points after the bundle adjustment. With the camera parameters, for each image we translate the world coordinate to the camera coordinate to compute the depth of each 3D point. For each camera, each building contributes an occluder if it exists in the scene at that time. However, except for projections of 3D points, we do not know depths of other pixels in the interior of convex hull. To estimate the depth value, we use the average depth dpkj of all 3D points belonging to building Bk as the depth of occluder Okj in the image Ij , which is dpkj = m1k Xj ∈Bk depth(Xj ) and mk is the number of 3D points in building Bk . That is, depths of all pixels in the region of are set to dpkj . We can save the depth of every pixel of a masking image to generate a depth map, where the background pixel’s depth is set to infinite. Since we assumed that no two buildings can exist in the same place at the same time, the occluders do not intersect or overlap with each other in the spatial domain. That means occluders have different depths in the camera coordinate system. If the projections of two buildings at different points overlap on the camera plane, we take the smallest depth of the occluder as the final depth in the intersection region. Besides this area, the depth of occluder remains as its original depth. Now we consider the problem of visibility which is a fundamental problem in computer graphics and virtual reality applications. A popular method addressing the problem of visibility is z-buffering approach, which decides which elements of
382
J. Xu, Q. Wang, and J. Yang
a rendered scene are visible, and which are hidden. When an object is rendered, the depth of a generated pixel is stored in a buffer. If another object of the scene must be rendered in the same pixel, it compares the two depths and chooses the one closer to the observer. The chosen depth is then saved to the z-buffer, replacing the old one. Finally, all the information inferred is organized into a detailed visibility matrix (DVM), in the similar way to [11]. It indicates the status of building Bi in image Ij , which is ⎧ ⎨ +1 if Bi is observed (7) vij = −1 if Bi is missing ⎩ 0 if Bi is out of view or occluded in Ij
5
Temporal Order Recovering
After we have obtained the low dimensional visibility matrix of buildings in each image, the temporal ordering problem can be then considered as a constraint satisfaction problem [11]. We formalize the constraints on the DVM as follows: on any given row of DVM, we should not see a value of -1 between any two +1 values. That is, we will not have any case where a building appears, then disappears, then reappears again, except for the case of occlusion or being outside the field of view of the camera. Next we utilize a local search algorithm to arrange the order of columns of DVM and make sure they do not violate the constraint above. First, we compute the numbers of constraints being violated in initial arrangement of columns, denoted as N (0) . Then we evaluate all local moves. In this application, a local move is equal to swapping the position of two images by rearranging the columns of the matrix accordingly. We chose the move with the ordering that violates the fewest constraints of all the candidate local moves being considered. We denote numbers of constraints being violated as N (i) . If N (i) is smaller than N (i−1) , repeat this step until an ordering of the columns is found that violate no constraints. If there is no move which decreases the number of constraints violated, we re-initialize the search with a random ordering. As mentioned in Section 1, our objective is to model urban scenes from large photo collections and to infer the temporal ordering of these 3D models, which we refer to as a 4D model. After we obtain the temporal ordering of the photos, we can further infer all 3D models into spatial-temporal space. We define the 4D model M 4D as a set of 3D models ordered in time. Each 3D model M 3D is represented as a set of buildings of {Bk | k ∈ 1, 2, ..., N }, which existed in the same period of time. The boundary between two 3D models is the time when some discrete changes happen to buildings such as appearing or disappearing. In order to build a convincing 4D model of a city, we need to segment images with valid temporal order into several periods of time, somewhat like temporal segmentation cutting the video sequence into “scenes” or “shots”. We first generate a new class T1 and put the first column of DVM into it. From the second
Modeling Urban Scenes in the Spatial-Temporal Space
383
column, for each column, we decide whether it is consistent with the previous class. The consistency between column Ci and class Tj is defined as: if if
Cij = +1 Cij = −1
⇒ ⇒
{Ckj = −1 | Ck ∈ Tj } {Ckj = +1 | Ck ∈ Tj }
(8) (9)
If the column Ci is consistent with class Tj , we put Ci into Tj , otherwise we generate a new class Tj+1 and put Ci into it. When each column has its label, temporal segmentation of images has been done. And the existing period of building is also obtained. Each class Tj represents one period of time among which there is no discrete change in buildings. Buildings existing in the specific era, which are visible in the images of class Tj , represent a 3D model Mj3D . All 3D models constitute the final 4D model, represented as M 4D = M13D , ..., MT3D , where T is the number of the period of time.
6
Experimental Results
We have performed various experiments to validate our time varying urban scene modeling algorithm using the images from the laboratory and real collections of Xi’an City. In the lab set up, we used some toy buildings to simulate the changes in a city. We simulate a new building being constructed by adding a toy building to the scene and simulate a building being demolished by taking the building away. We designed an entire change process in advance, and then take photos of the scene with different viewpoint and scale. Images are captured using a Sony T50 digital camera with fixed intrinsic parameters. 82 images with size of 3072 × 2304 are taken around the scene. The SfM procedure jointly generated projection matrices of images and 23,798 sparse points. After multi-view stereo reconstruction we finally obtain 217,914 dense points. Fig. 3 shows spares and dense point clouds of the scene, respectively, from which we can observe that many buildings are mixed in the same places, for example, the most top-left, top-right and the center of the simulated urban scene. After removing noisy points, there were 198,732 points left, which were considered as the clean data for further building segmentation. Then we execute hierarchical algorithm to partition these 3D points to independent buildings, as mentioned in Section 3. The segmentation result with 10 buildings is shown in Fig. 4. We can see that additional correspondence information provided by multi view images is very useful for segmenting overlapped buildings in the space. The visibility and existence of each building in each image were inferred by z-buffer and convex hull based visibility algorithms. The finally optimal visibility matrix is shown in the bottom of Fig. 5. The red grid indicates value -1 (missing). The blue one indicates value 1 (observed) and green one indicates value 0 (occluded or out of view). The local search only costs 10 minutes at most. We also perform [11]’s method for comparison, which uses point as basic unit. On
384
J. Xu, Q. Wang, and J. Yang
Fig. 3. An illustration of 3D scene reconstruction from 82 images. Left: sparse points. Right: dense points. We can see that some buildings are overlapping in the space, because they exist in the same place but at different time.
Fig. 4. An illustration of ten buildings obtained from 3D points by hierarchical segmentation algorithm
this data set, containing 198,732 points from 82 images of a scene, it was failed to produce meaningful results after two days running. That proves that treating building as basic unit is more efficient than point. In the final stage, we performed 3D models segmentation along the time axis on the basis of temporal ordered images and obtain 13 different scenes. Since we are more interested in new building constructions and have the limitation of the space, we show 10 scenes of new building appearing in the new location or the place where an old building was demolished in the top of Fig. 5. Other 3 scenes only with old building disappearance are not shown in Fig. 5 (denoted as (4), (6) and (9) with red color). As mentioned in Section 5, each model containing some buildings is corresponding to the specific period of time. For example, the
Fig. 5. An illustration of inferring the temporal ordering of 3D models
Modeling Urban Scenes in the Spatial-Temporal Space
385
Fig. 6. The whole 3D reconstruction result of Xi’an city
first model lasts until the second building is constructed and it transfers to the second model. The difference between model 12 and 13 is that two old buildings in the left part of the views are demolished. We can switch among these views to observe the appearance changes in the scene. We also performed experiments on a real collection of images, main avenues of Xi’an City, collected by ourselves and from Internet. Fig. 6 shows the screen shot of the reconstruction results from 9504 images by our system. The two shots are observed from different viewpoints. Fig. 7 shows urban scene changing on time varying. The scene locates on the North Avenue of Xi’an City, as highlighted by the red dashed round rectangle in Fig. 6. Three recovered scenes in Fig. 7(a) correspond to about 1990’s, 2000’s and 2010’s of the eastern side of North Avenue of Xi’an city respectively. There are initially 311,620 sparse and 1,855,518 dense points respectively. After noise filtering, we have 1,569,406 clear points of mixed buildings. The system obtains 9 buildings totally. These building are then taken as the input of the visibility reasoning and temporal order inferring steps. From these time-varying 3D models recovered in the spatial-temporal space, we can observe the changes of buildings from 1990’s and 2000’s to 2010’s. For example, we can look at the yellow ellipse highlighted building existed till 2000’s and then disappeared, and it is replaced with two new building models in the same place in the period from 2000’s to 2010’s. In addition, there was no building in the red dashed ellipse in 1990’s and a new building was built in the period from 1990’s to 2000’s. In Fig. 7(b), we can see the western side of North Avenue of Xi’an city having been changed greatly from 1990’s and 2000’s to 2010’s. The total amounts of 3D sparse and dense points are 287,052 and 1,882,531 respectively. After removing noise, there were 1,618,040 points left, from which 9 different buildings were segmented by our algorithm. We can look at the yellow ellipse highlighted area of Fig. 7(b). There was no building in 1990’s. In 2000’s, a new building was constructed in that place; however, another new building has replaced the old one in 2010’s. Furthermore, we can infer that the two buildings in red dashed ellipses were built after 2000’s.
386
J. Xu, Q. Wang, and J. Yang
Fig. 7. 4D urban scene changes of the North Avenue in Xi’an city in different period of time
7
Conclusions
In this paper we have presented a technique to simultaneously model 3D urban scenes in the spatial-temporal space using a collection of photos. Our main contribution is to use a middle level representation, building, to characterize significant structure changes in the scene. By abstracting the points to buildings, we can greatly reduce computational complexity in inferring temporal ordering of images, and we are able to handle large scale urban scene reconstruction in the spatial-temporal space. We employ the SfM techniques to obtain 3D point clouds from historical images, and then segment these 3D points into independent buildings by a hierarchical segmentation algorithm in space and time, including coarse clustering and fine classification. We infer the visibility of each building in each image using a z-buffering based reasoning method. Finally, we utilize the constraint satisfaction problems to solve temporal ordering problem. We have demonstrated the feasibility of the approach using toy building images from a lab and real urban scene images. Comparing to the state-of-art approach, our method can be better applied to large scale time-varying urban scene modeling. The approach can be used for many applications such as city virtual tourism and help for city planning.
Modeling Urban Scenes in the Spatial-Temporal Space
387
Acknowledgement. This work was supported by NSFC Fund (60873085) and National Hi-Tech Development Programs under grants No. 2007AA01Z314 and 2009AA01Z332, P. R. China. The third author was partially supported by National Science Foundation of USA.
References 1. Agarwal, S., Snavely, N., Simon, I., Seitz, S., Szeliski, R.: Building rome in a day. In: Proc. ICCV, pp. 72–79 (2009) 2. Akhter, I., Sheikh, Y.A., Khan, S., Kanade, T.: Nonrigid structure from motion in trajectory space. In: Proc. NIPS, vol. 1 (2008) 3. Chuang, Y.Y., Goldman, D., Zheng, K., Curless, B., Salesin, D., Szeliski, R.: Animating pictures with stochastic motion textures. In: Proc. ACM SIGGRAPH, pp. 853–860 (2005) 4. Dorninger, P., Nothegger, C.: 3d segmentation of unstructured point clouds for building modeling. In: Int. Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences (2007) 5. Faugeras, O., Bras-Mehlman, E.L., Boissonnat, J.D.: Representing stereo data with the delaunay triangulation. Artificial Intelligence 44, 41–87 (1990) 6. Filin, S., Pfeifer, N.: Segmentation of airborne laser scanning data using a slope adaptive neighborhood. ISPRS Journal of Photogrammetry and Remote Sensing 60, 71–80 (2006) 7. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view stereopsis. IEEE Trans. on PAMI 32, 1362–1376 (2010) 8. Pani, A., Bhattacharjee, G.: Temporal representation and reasoning in artificial intelligence a review. Mathematical and Computer Modeling 34, 55–80 (2001) 9. Pollefeys, M., Nister, D., Frahm, J.M., Akbarzadeh, A., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Kim, S.J., Merrell, P., Salmi, C., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewenius, H., Yang, R., Welch, G., Towles, H.: Detailed real-time urban 3d reconstruction from video. Int. Journal of Computer Vision 78, 143–167 (2008) 10. Snavely, N., Seitz, S., Szeliski, R.: Modeling the world from internet photo collections. Int. Journal of Computer Vision 80, 189–210 (2008) 11. Schindler, G., Dellaert, F., Kang, S.: Inferring temporal order of images from 3d structure. In: Proc. CVPR (2007) 12. Schindler, G., Dellaert, F.: Probabilistic temporal inference on reconstructed 3d scenes. In: Proc. CVPR (2010) 13. Simon, I., Seitz, S.M.: Scene segmentation using the wisdom of crowds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 541– 553. Springer, Heidelberg (2008) 14. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring image collections in 3d. ACM Transactions on Graphics 25, 835–846 (2006) 15. Xiao, J., Wang, J., Quan, L.: Joint affinity propagation for multiple view segmentation. In: Proc. ICCV (2007) 16. Xiao, J., Fang, T., Zhao, P., Lhuillier, M., Quan, L.: Image-based street-side city modeling. ACM Transaction on Graphics 28, 114 (2009) 17. Xu, X., Wan, L., Liu, X., Wong, T.T., Wang, L., Leung, C.S.: Animating animal motion from still. ACM Transactions on Graphics 27, 117:1–117:8 (2008)
Family Facial Patch Resemblance Extraction M. Ghahramani1 , W.Y. Yau2 , and E.K. Teoh1
2
1 School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore 639798 {moha0116,eekteoh}@ntu.edu.sg Institute for Infocomm Research I2R, Singapore 138632
[email protected]
Abstract. Family members have close facial resemblances to one another; especially for certain specific parts of the face but the resemblance part differ from family to family. However, we have no problem in identifying such facial resemblances to guess the family relationships. This paper attempts to develop such human capability in computers through measurements of the resemblance of each facial patch to classify family members. To achieve this goal, family datasets are collected. A modified Golden Ratio Mask is implemented to guide the facial patches. Features of each facial patch are selected, analyzed by an individual classifier and the importance of each patch is extracted to find the set of most informative patches. To evaluate the performance, various scenarios where different members of the family are absent from training but present in testing are tested to classify the family members. Results obtained show that we can achieve up to 98% average accuracy on the collected dataset.
1
Introduction
Soft biometrics the human characteristics that include information such as age gender, ethnicity etc, can be determined from the face data [1]. Other than the current soft biometric modalities, one can explore if a query face belongs to a certain group of people having similarities in their appearance. The closest relation of a person is his family and ability to recognize whether a group of people are related is a useful trait which can be addressed as “Family Classification”. In almost all face recognition applications, the individuals to be recognized have been enrolled a priori. Thus, there is at least one sample in the training dataset. However, there are interesting scenarios where it is imperative to automatically detect for possible candidates of a specific family member whose photo is not available. On the average, 2,185 children are being reported missing each day, such as at megamalls. It would be helpful to be able to find the missing child in a surveillance camera system by just using a few samples of their parents and other children [2]. In this paper, we attempt to develop algorithms for family classification. In Section 2, a literature review on facial patch analysis is discussed followed by the recent works about family relationship processing. Section 3 describes the approach used for family R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 388–399, 2011. Springer-Verlag Berlin Heidelberg 2011
Family Facial Patch Resemblance Extraction
389
classification followed by the selection of informative facial patches in Section 4. Section 5 describes the experimental results followed by conclusion and future works.
2
Background
Face recognition approaches can be divided into two main categories, Holistic and Feature based approaches [3]. Recent works tend to prefer feature based approaches due to their higher accuracy. On the other hand, holistic based approaches preserve the information of facial patch distances, shapes etc. One way to benefit from both approaches is to separate facial patches by face templates and process each in a different manner. The earliest paper proposing the face mask template to split patches was in [4] to detect frontal faces under illumination variation. It was then used to develop a socially intelligent robot that uses natural cues to interact with human being [5]. K. Anderson et al. modified the face mask template so that it fits the face region without including the background region. They incorporated the golden ratio, a facial mask constructed by Marquardt [6] in reconstructive plastic surgery [7]. The template was mainly designated for face illumination changes under various lighting angles. F. Li et al [8] split facial patches using the Scale Invariant Feature Transform (SIFT) descriptors and employ strangeness measurement to remove background and strange samples from the input data. In the other research work, facial patches were separated in a simple way that each patch represents one of the 5 main patches of the face. The data dimension was reduced by PCA and SVM determines the label [9]. In family classification related area, Das. M. et al. [10] developed a system for automatic albuming of consumer photographs, providing the user with the option of selecting image groups based on the people presence in them,. However, similar faces of family members were classified as the query person incorrectly. Singla P. et al. [11] investigated Markov logic to “discover social relationships in consumer photo collections”. In addition to the hard rules (e.g. parents are older than children) and soft rules (e.g. couples are of the opposite gender) in social relationships, Markov logic is able to learn new rules using the data. Their work is mainly about family photos where the mentioned rules are satisfied, but not in determining generic family relationship. The problem of family classification is addressed directly by using overall facial characteristics [12]. They concluded that in unseen family classification cases, when a member is absent in the training set, SIFT and Speed Up Robust Features, SURF, fail to recognize the absent family member. The family classification performance is not satisfactory with other features in unseen member classification tests too. Therefore, there is motivation to improve family classification further. Our work is different from all the earlier works in that it is based on the use of similar facial patches to perform family classification and find family facial resemblance.
390
M. Ghahramani, W.Y. Yau, and E.K. Teoh
Fig. 1. The overall framework of family classification using informative facial resemblance
3
Family Classification and Resemblance Extraction
Our strategy is to classify family members based on their facial resemblance measurement and remove the redundant information of patches as shown in Fig. 1, regardless of the classifiers and features types. Features of split facial patches (Data #i) are extracted and given to individual classifiers. The optimal decision making rule is employed on classifiers outputs. The final block identifies the redundant information and adjusts the weights of each patch using the training data. In this manner, the overall system is trained for the family database, to classify family members based on their facial resemblance measurement and remove the redundant information. The overall approach is as follows: 1. Split facial patches using the modified face template 2. Features of each patch are extracted and discriminative features of each patch feature set are selected by AdaBoost 3. The most informative set of patches are selected and the rest are removed to give a minimal set of patches for recognizing the family. 4. Based on the reliability of each patch, the optimal Chair-Varshney decision making algorithm is implemented for family classification Detailed explanation of the proposed algorithm is given in the following subsections. 3.1
Facial Patch Template
The distances of the faces’ interest points have almost the same ratio, known as the Golden Ratio. This average ratio varies for different ethnicities and also from face to face. Since family members resemblance each other, we postulate that the ratio would not vary for facial patches if the size of the face image is normalized. The modified face template obtained by Marquardt [6] does not include facial similarities of family members such as chin and face shape and the whole region of eyes or nose. To maintain ratios of the modified golden ratio mask and include similar family facial patches, the face template as shown in Fig. 2.b is proposed. Fig. 2.a, also shows members of family 3 with their split patches.
Family Facial Patch Resemblance Extraction
391
Fig. 2. (a) The proposed facial template and cropped facial patches (b) patch indexing
3.2
Features and Classifier
Based on the results obtained in [12], Gabor Wavelets were selected since they are highly robust to illumination changes and displacement. Gabor Wavelets. A Gabor wavelet is defined as [13],
where z=(x,y) is the coordinates of the pixel. The parameters μ and ν are orientation and scale of these filters which are chosen to be 5 and 8 respectively; σ is the standard deviation of the Gaussian window of the Gabor kernel with kν =kmax /f ν and ϕμ =πμ/8, if 8 different orientations are chosen. kmax is the maximum frequency, and f ν is the spatial frequency between kernels in the frequency domain. Therefore, Gabor features are extracted by convolving faces with this filter bank of Gabor wavelets with 5 scales and 8 orientations. AdaBoost classifier. AdaBoost is a widely used feature selection and learning algorithm that combines many weak classifiers to form a strong classifier by adjusting the weights of training samples given their classified labels, at each iteration [14]. For our AdaBoost training, we have employed a decision tree with 3 decision nodes as the weak classifier. 3.3
Mixture of Experts
The main purpose of Mixture OF Experts (MOE) algorithm is to improve the confidence of the decision using majority voting of the outcome of each individual expert and combining them through a thought process to make the final decision [15]. In our case, the classifier is regarded as the expert. MOE is suitable for family classification primarily because the training set may not be representative of the future data, especially since there will be cases where the test data is not seen during training. In addition, the data available is small which necessitates the use of divide-and-conquer approach and MOE can help find decision boundaries that lie outside the space of functions by using only one classifier model. In order to improve the performance of family classification and extract facial patch resemblance, we employ the MOE on the proposed Golden mask template. Given the face images, the eyes’ positions are detected using [16]. Then the proposed golden mask template is aligned using the eyes’ positions to split the image into
392
M. Ghahramani, W.Y. Yau, and E.K. Teoh
regions of interest. Subsequently Gabor features of each patch are extracted and sent to separate AdaBoost classifiers. The performance of each classifier in family classification on the training set is measured using False Negative Ratio, FNR, as the number of family members incorrectly classified as non-family members over total number of family samples in the dataset, and False Positive Ratio, FPR, as the number of non-family members incorrectly classified as family members over total number of non-family samples. Our goal is to minimize the FNR for family classification. Optimum Decision Making Rule. To obtain the final decision based on the outcomes of the various patches, the combination of the decision from all the designed experts, patches, are required. There are many types of decision making rules used to perform such combination available in the literature such as Majority Voting, (MV), Weighted Majority Voting, (WMV), Borda Count or by algebraic combination of minimum, maximum, median, product or sum rule. The optimal solution for decision making based on a priori information is to employ the Chair-Varshney rule proposed in [17]. According to this rule, the final decision “f ” is given by:
where P1 is the apriori probability of a sample to be a family member (denoted as C1 ) with P1 = p(C1 ) and vice versa, P0 is the apriori probability of a sample to be a non-family member (denoted as C0 ) with P0 = p(C0 ). There are ‘N ’ classifiers, here N = 17 is chosen since there are 17 patches, assuming that the output of each patch is independent from others. The AdaBoost classifier output ui , i=1,.., N will be “1” if it is a family (C1 ) and “-1” if it is a nonfamily member (C0 ). S1 ∈ {ui | C1 } and S0 ∈ ui | C0 }. F P Ri and F N Ri are the i-th classifier positive and negative error rate respectively. In Eq. (2) all the parameters, except the value (log(P1 /P0 )), can be calculated using the training data. The value (log(P1 /P0 )) is then set as the threshold in the decision making. Since the goal is to minimize FNR using the non-family data, after training the classifiers, based on the Neyman-Pearson criterion, we set the desired FPR indicated by F P RD as a constant value to obtain the threshold. Hence, the optimum decision making rule with the least FNR is achieved using Eq. (2).
4
Selection of Important Patches
We can measure how reliable each patch is to classify a family member by using AdaBoost classification on the extracted features from the split patches of the family members’ faces. This is possible since each patch has different degree of similarity among the family members and thus their value for family classification differs. The patch resemblance is thus family specific. After identifying the family members, we proceed to find the informative patches among the split patches of a family. The most informative set of t -patches are the ones which bear close resemblance among the family members as identified by the classifier (‘t’ is the
Family Facial Patch Resemblance Extraction
393
number of patches chosen, t=2,.., N). The set of informative patches found, should also be able to generalize to other members who are absent in the training set as an indication of the ability to generalize these informative patches. In the MOE, diversity of classifiers is a key criteria to obtain good performance since the main idea of MOE is to have a pool of classifiers in a way that individual classifiers have good performance on a different part of the data samples such that the combination provide the best results for the entire data samples. In other words, they should be as diverse as possible. There are several measurements of diversity in the literature, including entropy measurement or measure of difficulty. The most appropriate approach being capable of diversity calculation for ‘n’ out of a total of ‘N ’ classifiers is to calculate the diversity on any set of classifiers implemented on each patch with respect to all other patches to evaluate the set of the most informative patches for a certain family. Based on the definition, diversity of ‘n’ out of total ‘N ’ classifiers, is given by ρ [18],
where, ‘N f ’ is the number of all cases that ‘n’ classifiers are all false, ‘N t ’ is the number of all cases that ‘n’ classifiers are all true, ‘NT ’ is the total number of cases (conditions) and ‘n’ is the number of selected experts. In order to calculate all classifiers’ diversities or patch informativeness, we must find all combinations of ‘N ’ classifiers’ diversities, which leads to the number of ‘N! ’ permutations. Diversity is a kind of correlation, since corr(p,q)=corr(q,p) for real outputs here absolute value of Gabor features, the total number of permutations is given by:
Diversity of a set of classifiers must be minimum for the best combination of informative classifiers with the least error. Among every set of minimum diverse t -patches, the training error is evaluated by the optimum decision making rule and the set of t -patches with the least error is selected as the optimal set for ‘t ’ number of informative patches. Based on this, it is possible to find out the minimum number of informative patches beyond which the performance will not differ much. In the proposed scheme, not only confidence level of patches is considered, but also the amount of diverse information each patch carries through the diversity analysis.
5 5.1
Experimental Results and Discussion Family Datasets
For our experiments, we are not able to find any publicly available datasets. We constructed our own dataset, comprising, a published family photo dataset [19], family photo datasets of famous people, social websites and relatives. In the chosen dataset, we tried to minimize the common face variations such as
394
M. Ghahramani, W.Y. Yau, and E.K. Teoh Table 1. Specifications of the family dataset collected
pose and illumination changes. The collected family datasets are chosen to be diverse enough in terms of parents aging effect, having parents and children in different age groups and ethnicities. Most of the available family datasets have more samples from children than parents. On the average, we collected about 25 samples from every parent and 40 samples from children to have more than 1600 samples for 15 families, in total. Table 1 shows the properties of the database. 5.2
Family Classification
The following experiments are conducted to cover real applications: Test 1: The family members are present in both training and test sets. Test 2: A member is omitted from the training set, i.e. missing child or unknown father. Two different scenarios are defined: Test 2.a in which the child with maximum number of samples is omitted from the training set and Test 2.b in which the parent with max number of samples is omitted. In test 1, 67% of family members and 67% of 1000 faces from the FERET dataset [20] taken as nonfamily members are trained and tested with the rest of the data. In test 2, 67% of all the family members’ samples, except the omitted member, are used for training together with 67% of nonfamily members. In test 2, the test set includes only the omitted member with the remaining nonfamily members. Faces of each dataset are detected, aligned and cropped to size 80x95 based on eyes coordinates. The chart in Fig. 3 shows the family classification performance for the collected family datasets in different test cases while ‘F P RD ’=0.02. The x-axis is the ascending family index. Since parents have less resemblance, test 2.b is less accurate than other tests for all families except family 9 whose children are quadruplets. Apparently, parent’s aging effect is the main cause of low performance for families 3, 10, 12 and 14.
Family Facial Patch Resemblance Extraction
395
Fig. 3. Family classification performance for family datasets in defined test scenarios
Using Eq. (5), we could achieve 98% accuracy (1-Err) to classify family members.
The conventional face recognition employs a single classifier on the extracted features. Using the AdaBoost classifier and Gabor features in the conventional face recognition [21] for each member and the final decision of being a family member is made if any of classifier’s trained for each member recognizes the query sample as a family member. The experiments show that the proposed algorithm outperforms the conventional scheme particularly in test 2, since the missing member can be found by the whole family facial similarities which are evaluated in the proposed method. On the average, the total error decreases by 5%, 7%, 8% for test 1, 2.a and 2.b respectively by the proposed method. The FNR decreases on the average by 10%, 12%, 13% for test 1, 2.a and 2.b comparing to the conventional method. Note that, family classification performance can be enhanced by minimizing the FNR which considered in this research work. 5.3
Family Facial Patch Resemblance
Family datasets aforementioned in Table 1 are used to extract the family facial patch resemblance based on the proposed algorithm explained in Sec 3 and 4. Table 2 shows a sample of each member from family 1-5 and the corresponding normalized facial patch resemblance using the patch reliability measurement color map at the last column for test 1. The results for other families and tests are not shown due to limited space. The color maps of patches’ reliabilities for 10 families show that, - eyes, patch 5 & 7, are the most reliable facial patches - patch 2 and 3 are less reliable in families with high aging effect - the reliability is almost symmetrical horizontally Note that, the resemblance symmetry can be distorted due to hair style, face shape or inaccurate face alignment caused by unequal facial patch distances.
396
M. Ghahramani, W.Y. Yau, and E.K. Teoh
Table 2. Family facial patch resemblance using reliability measurement color map
However, the resemblance of other patches including eyes and cheeks are symmetrical for family 5. Moreover, the face ratio of the father of family 5 is very different from other members. The next step is to find the most informative tpatches as described in Section 4. In the literature, the feature subset selection methods assume that the importance of each feature is independent from the other features and the kernels should be located at fiducial points [22]. However, these two constraints cause sub-optimality of the subset selection. In the proposed method, the importance of each patch is being evaluated with dependency on the rest of the patches to achieve an inter-patch analysis without constraints on the feature subsets. As shown in Fig. 4 the informative patches found are not only around the fiducial points. Family classification results obtained for family 3, test 1 are tabulated for both training and testing sets in Table 3 with respect to the number of patches (t). The right-most column of Table 3 shows the total error (Err) which is obtained by calculating the number of samples classified incorrectly over the total number of samples. The total error provides an indication of the informativeness of the patches. The F P RD is set as 0.02 to obtain the threshold value as per equation (2) after 25 iterations of training the classifier. The result obtained shows that even with small t, the total error is only 0.02 to 0.03 more than using all patches. For example, even seven patches (t=7) are enough to achieve almost the same performance as using all patches for family 3 in test 1.
Family Facial Patch Resemblance Extraction
397
Table 3. Determination of t informative patches for family 3 classification, test 1, with FNR error in both training and testing sets and (Err) of testing set
Table 4. Minimum number of patches needed to achieve almost the accuracy of all patches for family classification of families (stated in Table 2) and test 1
The trend of decreasing FNR with increasing number of patches is similar in both the training and test sets. Again, the tabulation for other families and test cases are not included due to limited space. The summarized results of the minimum number of patches to achieve almost the same accuracy of 17 patches are stated in Table 4. The patch indexes and the FNR (test set) with t -patches was achieved, is also tabulated to compare with the FNR (test set) of all patches for family classification of families 1-5 and test 1. Note that, patches are sorted based on the index order in Table 3 and 4. The histogram of patches selected as the most informative ones in Table 3 for family 3 is shown in Fig. 4.c to find the importance of each patch in family classification using the color map as shown in Fig. 4.f. The same analysis is applied for family 1, 2, 4 and 5 to extract the informativeness measurement of facial patches and results are shown in Fig. 4.a, 4.b, 4.d, 4.e respectively. This histogram is the measurement of each patch’s informativeness for family classification in test 1. However, it does not mean that unimportant patches are not reliable but they do not add information to the most informative patches selected. This supports the human intuition of recognizing the family relationship based on judging the similarity of certain parts of the face. Thus by finding such parts, family classification performance can be optimized.
398
M. Ghahramani, W.Y. Yau, and E.K. Teoh
Fig. 4. The importance histogram color map for family a) 1, b) 2, c) 3, d) 4, e) 5 and test 1; f) color map indexing
6
Conclusion and Future Work
In this paper, we proposed a family classification approach using facial patch analysis. The facial patches are split using the modified Golden ratio mask. For each patch, Gabor wavelet features are extracted and the AdaBoost classifier is used to classify the patches and measure the facial patch resemblance in the family. The optimal decision making rule is employed on the classification results to obtain the final results of all the samples. From the training stage, a predetermined number of the most informative patches are selected with the remaining patches regarded as redundant. To optimize the patch selection, redundant patches has to be removed while keeping the accuracy achievable as high as possible. The constraint-free inter patch analysis of facial patches is proposed to select ‘t ’ number of the most informative patches. The result obtained show that such selection can provide results close to using all the patches. Depending on the family, just using as low as 4 patches provides almost similar performance compared to using all 17 patches. This will allow optimizing the performance and computational time, supporting the human intuition that family classification can be judged by finding the similar facial parts. We also showed that the proposed approach can generalize well even when some of the family members have not been seen before in the training set. However, the family classification performance is poor when one of the parents is absent, especially in small datasets. This needs to be improved by using features that are capable of extracting more subtle similarities in a family which is our future work.
References 1. Geng, X., Zhou, Z.H., Smith-Miles, K.: Automatic Age Estimation Based on Facial Aging Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(12), 2234–2240 (2007) 2. Sedlak, A.J., Finkelhor, D., Hammer, H., Schultz., D.J.: National Estimates of Missing Children: An Overview. Office of Juvenile Justice and Delinquency Prevention, Office of Justice Programs, 5 (2002) 3. Li, S., Jain, A.: Handbook of Face Recognition. Springer, Heidelberg (2005) 4. Sinha, P.: Perceiving and Recognizing Three-Dimensional Forms. PhD Dissertation, Massachusetts Institute of Technology (1996)
Family Facial Patch Resemblance Extraction
399
5. Scassellati, B.: Eye finding via face detection for a foveated, active vision system. In: The Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, pp. 969–976 (1998) 6. Marquardt, S.R.: Marquardt Beauty Analysis, www.beautyanalysis.com 7. Anderson, K., McOwan, P.W.: Robust real-time face tracker for cluttered environments. Comput. Vis. Image Underst. 95, 184–200 (2004) 8. Fayin, F.L., Wechsler, H., Tistarelli, M.: Robust fusion using boosting and transduction for component-based face recognition. In: 10th International Conference on Control, Automation, Robotics and Vision, ICARCV 2008, pp. 434–439 (2008) 9. Andreu, Y., Mollineda, R.A.: The Role of Face Parts in Gender Recognition. In: The 5th International Conference on Image Analysis and Recognition, pp. 945–954 (2008) 10. Das, M., Loui, A.C.: Automatic face-based image grouping for albuming. In: IEEE International Conf. on Systems, Man and Cybernetics, vol. 3724, pp. 3726–3731 (2003) 11. Singla, P., Kautz, H., Jiebo, L., Gallagher, A.: Discovery of social relationships in consumer photo collections using Markov Logic. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2008, pp. 1–7 (2008) 12. Ghahramani, M., Wang, H.L., Yau, W.Y., Teoh, E.K.: Feature study and analysis for unseen family classification. In: The Proc. of SPIE, pp. 7546–75460Q (2010) 13. Chengjun, L.: Gabor-based kernel PCA with fractional power polynomial models for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(5), 572–581 (2004) 14. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: The Second European Conference on Computational Learning Theory (1995) 15. Polikar, R.: Ensemble based systems in decision making. IEEE Circuits and Systems Magazine 6, 21–45 (2006) 16. Stephen, M., Fred, N.: Locating Facial Features with an Extended Active Shape Model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 504–513. Springer, Heidelberg (2008) 17. Chair, Z., Varshney, P.K.: Optimal Data Fusion in Multiple Sensor Detection Systems. IEEE Transactions on Aerospace and Electronic Systems AES-22(1), 98–101 (1986) 18. Goebel, K., Yan, W.: Choosing Classifiers for Decision Fusion. In: The Seventh International Conference on Information Fusion 2004, pp. 563–568 (2004) 19. Gallagher, A.C., Tsuhan, C.: Clothing cosegmentation for recognizing people. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8 (2008) 20. Phillips, P.J., Hyeonjoon, M., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(10), 1090–1104 (2000) 21. Mian, Z., Hong, W.: Face Verification Using GaborWavelets and AdaBoost. In: The 18th International Conference on Pattern Recognition (2006) 22. Gokberk, B., Irfanoglu, M.O., Akarun, L., Alpaydm, E.: Optimal Gabor kernel location selection for face recognition. In: International Conference on Image Processing, ICIP 2003, pp. I-677–I-680 (2003)
3D Line Segment Detection for Unorganized Point Clouds from Multi-view Stereo Tingwang Chen and Qing Wang School of Computer Science and Engineering Northwestern Polytechnical University, Xi’an 710072, P.R. China
Abstract. This paper presents a fast and reliable approach for detecting 3D line segment on the unorganized point clouds from multi-view stereo. The core idea is to discover weak matching of line segments by re-projecting 3D point to 2D image plane and infer 3D line segment by spatial constraints. On the basis of 2D line segment detector and multiview stereo, the proposed algorithm firstly re-projects the spatial point clouds into planar set on different camera matrices; then finds the best re-projection line from tentative matched points. Finally, 3D line segment is produced by back-projection after outlier removal. In order to remove the matching errors caused by re-projection, a plane clustering method is implemented. Experimental results show that the approach can obtain satisfactory 3D line detection visually as well as high computational efficiency. The proposed fast line detection can be extended in the application of 3D sketch for large-scale scenes from multiple images.
1
Introduction
In recent years, image-based rendering for large-scale scene has been becoming a hot topic in the field of computer vision and virtual reality. There are mainly three kinds of rendering models in 3D scene visualization, including point cloud based model, line segment based model and texture based model. Point cloud based model is efficient for rendering, but it lacks of ability to describe the overall structure of the scene. Texture based model has a strong sense of reality, but the number of texture images is huge and the rendering speed is often not to achieve the purpose of fast roaming if the larger scene is considered. At this point, line segment based model becomes the best choice for a fast rendering due to the simple structure and high performance. Line matching across views is known to be difficult when performing the 3D line segment detection. Although various types of prior information have been used in previous approaches [8, 10, 11, 17, 18], the task of line matching remains challenging. Because previous techniques are based on the idea of searching for line correspondence in 2D planar space, the line matching algorithm is roughly exponential to the number of images. Meanwhile, traditional methods improve the accuracy of 3D line segments only by restraining matching accuracy of 2D line segments, e.g., multiple match pairs are linked into a connected component verified for consensus in at least four views [12–14], which could not remove false R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 400–411, 2011. Springer-Verlag Berlin Heidelberg 2011
3D Line Segment Detection for Unorganized Point Clouds
401
3D lines successfully in the complicated scene. To address these problems, we proposed an efficient computational model for line matching and an effective mechanism based on spacial plane clustering to remove the false line segments. Compared to previous work, this paper makes the following contributions: – proposes a new line matching model which is called weak matching to reduce the time complexity from exponential level to multiplier level. – presents a spatial coplanarity constraint to eliminate the false matching and remove false 3D line segments. 1.1
Traditional Line Matching Algorithm
Traditional 3D line segment detection consists of two steps: (1) match the 2D line segments between images and find the inter-image homography; (2) triangulate the matching 2D line segments and obtain a 3D line segment. It is known that time complexity of traditional algorithm is exponential to the number of images. That is to say, given N images and approximately M lines in each image, the total complexity of line matching is roughly O(M N ). In order to reduce the quantity of matching, Schmid et al. [10, 11] and Heuel et al. [8] introduced ways of using geometric constraint to restrict searching area within the epipolar beams, respectively. Woo et al. [17, 18] used disparity data among images as a prior information to find out lines with similar disparity distribution. Such methods could improved matching efficiency in some cases, but need to know the epipolar geometry or disparity in advance, which is limited to small scenes and can not be extended to large scale scene applications. Sinha et al. [12] and Snavely et al. [13, 14] derived sets of candidate 2D line matches by comparing line segments between pairs of nearby images which is suitable for large scale scene. But these methods use intensity values around lines for similarity measurement which may be not effective for wide baseline images. Bay et al. [2] proposed to find a solution from similarity measurement between images by using color histogram as feature descriptor to reduce computational complexity. But this method fails when the illumination changes dramatically. More recently, Fan et al. [4] introduced an affine invariant to match lines by point correspondences. But it need to find corresponding points coplanar with lines since they can be regarded as related by the same affine transformation, which may fail when dealing with images including sharp disparity distribution. Wang et al. [16] proposed to cluster line segments according to the relative position to form a more robust feature representation. However, the issue of combinatorial expansion still remains. Eden et al. [3] proposed to calculate 3D ROI (region of interest) which divided the space region into smaller cubes and solved the matching problem for line segments that lied inside each cube. Assuming that lines are distributed in the cubes homogeneously, the estimated number of lines inside each cube is M/C and the algorithmic complexity is O(M N /C N ), where C is the total number of cubes. However, the time complexity becomes unacceptable when N is greater than 100. In real world applications of multi-view
402
T. Chen and Q. Wang
Fig. 1. An overview of 3D line segment detection from multi-view stereo
stereo towards Internet-scale [1, 6], the number of images exceeds over ten or hundred thousands, so we need to find new effective and efficient ways to solve 2D line matching issue. In addition, as the number of images becomes larger, more and more matching errors occur, which produces false 3D line segments inevitably. We need an outlier screening criterion to remove false line segments but this issue has not been fully addressed in the state-of-art approaches. 1.2
The Approach
We propose a new matching model, so-called weak matching, to collect correspondence between 3D points and 2D lines. A fast line segment detector (LSD) [7] is applied to extract 2D line segments. Moreover, we employ plane clustering and check whether endpoints of line segment belong to the same plane to screen out false matching derived from re-projection ambiguity. Fig. 1 depicts an overview of the proposed approach. The left of Fig. 1 is the input of a photo collection, e.g., benchmark data of Hall, and the right is the output of 3D line segment model. In the middle of Fig. 1, the process flow shows the detailed steps in 3D line segment detection. From top to bottom, a fast LSD detector is implemented to extract 2D line segments in each image. Then a dense 3D point clouds are reconstructed by the SfM methods. We re-project each 3D point to obtain a weak matching of 2D lines and search each image to find a best re-projected line with maximum re-projection points. Finally, we obtain 3D line segments through a back-projection. All lines are further verified by the coplanar constraint that two endpoints of 3D line should be in the same plane. The false 3D line segments are removed by a plane clustering algorithm. The high-lighted steps of Fig. 1 are the main contributions of this paper. In the following sections, we address three key issues, 2D line segment matching, 3D line segment generation and false matching removal, respectively.
3D Line Segment Detection for Unorganized Point Clouds
2
403
A New Matching Model for 2D Line Segment
Suppose we have K 3D points {X K } and N camera matrices {P N } reconstructed from N images {I N } by the SfM methods [5, 13] and Mi lines extracted by LSD [7] in ith image. Given a spatial point X, the re-projection of X in each image is xi = Pi X, where Pi is the projection matrix of ith camera. We have an assumption before introducing matching model. That is, the re-projection point xi falls into image Ii in two ways, (1) xi does not belong to any lines; (2) xi belongs to one line only. Case 2 happens when xi is a true feature point and can be observed in that image. We define the weak matching between line segments in two images, if and only if the two line segments are associated with the same spatial point X, i.e., lim1 ljm2
if and only if X = Pi−1 xi , X = Pj−1 xj , xi ∈ lim1 , xj ∈ ljm2
(1)
th where lim1 is the mth image and ljm2 is the mth 1 line (m1 ∈ [1, Mi ]) in the i 2 th line (m2 ∈ [1, Mj ]) in the j image, respectively. We collect 2D line segments hit by xi in possible views with a list outX by Eqn.(2). Then the line segments in outX match with each other weakly.
outX = {lim |xi = Pi X, xi ∈ lim , i = 1, 2, . . . , N }
(2)
Assuming that a 3D line segment includes X. Accordingly, we can build a point list outlim by Eqn.(3) to collect the re-projection points in ith view, which is used to reconstruct 3D line. outlim = {xi |xi = Pi Xk , xi ∈ lim , k = 1, 2, . . . , K}
(3)
Fig. 2 shows an example of weak matching among 2D lines within three images. The 3D point X is re-projected to 2D points x1 , x2 and x3 in three views, highlighted by green color, respectively. The red line segments l11 , l21 , and l31 are associated with the same spatial point X. The same thing happens to the blue line segments l22 and l32 , but the spatial point Y can not be found associated with any lines in the first view. So two corresponding line sets of X and Y are created as outX = {l11 , l21 , l31 } and outY = {l22 , l32 }, respectively.
Fig. 2. An illustration of weak matching
404
T. Chen and Q. Wang
As 2D line segments have width, the calculation will become very complex when matching the lines between images by judging the geometric relationship among them. In order to improve the efficiency, we take Algorithm 1 to compute weak matching, where outX is a line set corresponding to the same point X, and outlim is the re-projection point set falling into the line lim . And {IˆN } is a set of N label images corresponding to the 2D line segments in N images. A label image is acquired by a flood fill algorithm according to the 2D line segments in one image. The label of xi is defined in Eqn.(4). m if xi ∈ lim ˆ I(xi ) = (4) 0 otherwise For any point X, if the label of xi = Pi X is equal to m, we can deduce xi ∈ lim . Pixels on the same line have the same label, i.e., the line’s label. One can find in Algorithm 1 that the matching process completes after a double loop. Given K spatial points and N label images, the time complexity is bounded to O(N K). The larger the line set is, the more probable a spatial point corresponds to a 3D line segment. When the size of a line set is smaller than 2, which means a spatial point can be re-projected into less than 2 images, → − → − we call such line a pseudo contour. As shown in Fig. 3, the red lines ab and cd are detected in two different images, but they should not correspond to any line segments because they are not real contours of the object. In order to remove the pseudo contour, the number of a meaningful line set must be restricted to 2 at least in the experiment (See more details in section 4).
3
3D Line Segment Generation
Given a 3D line segment L, we suppose Xk is a point in L, and xik is a reprojection of Xk in the ith image. Considering every 2D line segment that contains xik , l can be the best re-projection line of L only if the number of the re-projection occupies the most compared to other lines. And endpoints of L can be deduced by the re-projection set in l. Given a re-projection point x and 2D line segment l, the indicator function E is defined as, 1 if x ∈ l E(x, l) = (5) 0 otherwise
Fig. 3. An illustration of pseudo contour in planar images
3D Line Segment Detection for Unorganized Point Clouds
405
Algorithm 1. Weak matching Input: 3D points {X K }, projection matrices {P N }, line segment label images {IˆN }. Output: list outlim for attaching points of each lim , list outX for attaching lines of each point X. 1 2 3 4 5 6 7 8 9
foreach point X in {X K } do xi ← XPi ; foreach label image Iˆ in IˆN do ˆ i )==m then if I(x add xi to outlim ; add lim to outX; endif endfor endfor
Let A(i, m) be the statistical term of the re-projection of all points in the mth line segment of the ith image, K
A(i, m) =
E(xik , lim )
(6)
k=1
where K is the number of 3D points, xik is the re-projection of spatial point Xk in the ith image, and lim is the mth line segment in the ith image. A 3D point can be re-projected to N different images, while each image has Mi lines respectively. The best re-projection line is solved by, (i∗ , m∗ ) =
argmax
A(i, m)
(7)
1≤i≤N,1≤m≤Mi
When the best re-projection line li∗ m∗ is determined by Eqn.(7), we take reprojection points that stand nearest by the endpoints of li∗ m∗ as the endpoints of 3D line. The 3D line segment is obtained through a back-projection. Let {x} → be the re-projection set on the best line segment − p− 1 p2 , i.e., li∗ m∗ , and q1 , q2 be −−−→ the endpoints of projecting 3D line segment Q1 Q2 in the best view, thus the coordinates of q1 and q2 are calculated as, q1 = argmin dist(xk , p1 ), xk ∈{x}
q2 = argmax dist(xk , p1 )
(8)
xk ∈{x}
where dist(·, ·) is the Euclidean distance of two points. The coordinates of endpoints of 3D line are obtained by a back-projection, −−−→ −1 i.e., Q1 = Pi−1 ∗ q1 and Q2 = Pi∗ q2 . Then a 3D line segment Q1 Q2 is created.
4
False Matching Removal
The new matching model permits the partial matching between line segments. Such model is comfortable for the simple planar model but may fail in the
406
T. Chen and Q. Wang
Fig. 4. An illustration of re-projection ambiguity
complex scene. False matching occurs when occlusions or large disparities exist where non-coplanar points may fall into the same line segment after the reprojection. Fig. 4 show an example of re-projection ambiguity. The red line → − → − segments ab in the left image and cd in the right image may match partially. → − → − The green point x1 falling on ab and x2 on cd are re-projections of spatial → − point X in two different images. Through the algorithms discussed above, ab is selected as the best line segment of re-projection for it maximizes the statistic → − term of re-projection points. However, ab corresponds to a false 3D line when → − back-projected to 3D space. The real best line segment of re-projection is cd. In order to eliminate the ambiguity caused by re-projection, this paper uses a plane clustering algorithm [9] to extend the dimension. The discrete spatial points are fitting to a series of local planes, where curved surface is approximated by some small planes. After clustering process, each spatial point is assigned with a index of plane. A re-projection point inherits the index of the spatial point which is projection-related. A new constraint is added to the line segment detection process, i.e., all re-projection points in the best line must be coplanar with the reference point. The statistic term is modified as ˜ m) = A(i,
K k=1
X E(xp=p , lim ). ik
(9)
where p and pX are the plane indexes of 3D point Pi−1 xik and the reference point X, respectively. Algorithm 2 shows a pseudo-code of the complete algorithm. We first search every line segment associated the spatial point X for a best line of re-projection, then obtain two points nearest to the line’s endpoints as well as coplanar with X. The 3D line segment is generated by back-projection. Once a best line is found and the re-projection endpoints q1 and q2 are determined, other points that fall → q− on the line segment li∗ m∗ are represented by the line − 1 q2 . After that, the status −1 −1 of 3D points which fall into the set [Pi∗ q1 , Pi∗ q2 ] are changed to used.
5
Experimental Results and Analysis
To evaluate the performance of the proposed algorithm, we use three image database, which contain planar model, occluded model and curve model. Bundler [13] and PMVS (Patch-based Multi-view Stereo) [5] softwares are used for obtaining both camera projection matrices and 3D point clouds. LSD
3D Line Segment Detection for Unorganized Point Clouds
407
Algorithm 2. 3D line segment detection Input: 3D points {X K },projection matrices {P N }, line segment label images {IˆN }. Output: result list out for 3D line segment. 1 foreach point X in {X K } do 2 if status(X) == notUsed then 3 xi = Pi X; 4 foreach line lim in outX do ˜ m) ← Sum(E(xik , lim )); 5 A(i, 6 endfor ˜ m)); 7 (i∗ , m∗ ) ← argmax(A(i, ∗ 8 if all points of outli m∗ are coplanar with X then 9 find (q1 , q2 ) nearest the line’s endpoints; −1 10 Q1 = Pi−1 ∗ q1 , Q2 = Pi∗ q2 ; −−−→ 11 add line segment Q1 Q2 to out; −1 12 status of point in [Pi−1 ∗ q1 , Pi∗ q2 ] ← Used; 13 endif 14 endif 15 endfor
detector is used for 2D line segment detection which has been proved highly efficient and precise [7]. Label images are created by a flood fill algorithm. All 2D line segments in one image correspond to a label image. When searching 3D point cloud for the best line of re-projection, we do not need to investigate images that do not contain the reference spatial point. The homography between points and images (in patch file) is used to decrease the size of searching images. All the experiments are executed on a PC with Intel (R) Core (TM) 2.0 GHz processor and 2GB memory. 5.1
Simple Planar Structure Data
We take the benchmark data set of Wadham College (5 images of 1024 × 768) for testing. The experimental result is shown in Fig. 5. Fig. 5(a) is one of original images and Fig. 5(b) is the flood fill result of 2D line segments. The white lines are results from LSD and red points are the re-projection of spatial points which can be viewed in this image. Each line contains at least one point. The line segment corresponds to a 3D line segment under the condition of containing at least two points. The result of 3D line segment detection is shown in Fig. 5(c) where the contour of the building can be seen clearly although some 2D line segments in Fig. 5(b) may not correspond to any 3D lines. The number of final 3D lines is 564 and the total running time is 0.03s. Fig. 5(d) is the sketch rendering view drawn with both spatial points and 3D line segments. 5.2
Complex Building with Occlusions
Effect of plane clustering for false matching removal is verified on two sets of images when occlusion occurs, as shown in Fig. 6. The first row is the benchmark
408
T. Chen and Q. Wang
Fig. 5. The result of simple planar building
of Hall (31 images of 3008 × 2000), and the second row is Korea Town (40 images of 2240 × 1488) taken by ourselves. Both of them have occlusions or large disparities. The first column of Fig. 6 shows one sample of original images, and the second one shows the result of 3D line segment detection without coplanar constraint. The re-projection ambiguity causes a number of false matching and produces some wrong lines, e.g., non-coplanar lines near the windows of the Hall model, and the bottom of the Korea Town model. The third column shows the final result with coplanar constraint, from which we can find most false lines have been removed and the contours of the buildings are clear. We provide a non-photorealistic rendering mode where point coloring is used to give an impression of scene appearance and detailed geometry is drew by 3D line segments. Final result of the Hall model on 61 images are shown in Fig. 7. The upper row shows the sketch rendering model from 3 different sides. The lower part depicts the partial enlarged details of the building. 5.3
More Challenging Data
The following experiment is done to test the models with curves. As the curved surface has been fit by a lot of small planes in the clustering process, the coplanar constraints are weakened there. Long lines on the surface have been cut into a lot of short ones. Once the short line segment is found coplanar, the corresponding 3D line segment is created.
Fig. 6. The result of buildings with occlusions
3D Line Segment Detection for Unorganized Point Clouds
409
Fig. 7. The enlarge details of result of Hall with occlusions
Fig. 8. The result of buildings with curves and complex surfaces
Fig. 8 shows the result of buildings with curves and complex surfaces. The first row is the Press Building (100 images of 2128 × 1416). The right side of the building is composed of a large curve while the left side is planar structure. The contour of the building is clear although the long line is cut. The second row is the Bell Tower which contains 137 images with the same resolution. The main structural information is detected and little false lines exist in the model. Such a sketch effect is good enough to be accepted. 5.4
Analysis
To detect 2D line segments for hundreds of images, we use fast 2D line segment detector which costs less than 1 second when processing an image of 512×512 [7].
410
T. Chen and Q. Wang Table 1. Details of line segment detection and computational cost (s) Data set Wadham College Hall Korea Town Bell Tower Press Building
Ni
Np
Nl
Tm
Tl
Ttotal
5 31 40 137 100
6,409 212,272 59,297 119,536 196,607
564 6,094 7,129 9,672 23,844
0.03 1.16 1.47 3.02 3.42
0.00 0.02 0.03 0.03 0.06
0.03 1.18 1.50 3.05 3.48
Compared to traditional algorithms, our algorithm reduces the time complexity from O(M N ) to O(N K), where N is the number of images, K is the number of 3D points and M is the approximate number of lines in each image. According to [4, 15], geometric constraint is introduced to restrict searching area within the epipolar beams which reduces number of lines M in each image, e.g., search space of lines in one image decrease by 5.55 [3]. While matching lines between images, a set of other images, e.g., 32 closest cameras, is used to compute. This set contains images whose cameras are close to each other which reduces the number of images N . Similarly, we only consider points which can be viewed in more than 4 images. Once a match is found, points re-projected onto the matched lines are eliminated. Instead of calculating the intensity coherence in previous methods, we re-project spacial point onto 2D plane and find matched line segments corresponding to the same 3D point. This is finished by a labeling process which comes to be easy and highly efficient. Table 1 shows details of experimental results. Ni , Np , and Nl are the amount numbers of input images, 3D point clouds and output 3D line segments, respectively. Tm , Tl and Ttotal are the costs (s) of matching for 2D line segments, 3D line segment generation and total running time, respectively. One can see from Algorithm 2 that 3D line segment generation is implemented in a double loop. All the costs of 3D line segment generation are close to 0, and 2D line matching over 100 images costs less than 4 seconds.
6
Conclusion
2D line segment matching cross multi-views is very time-consuming due to the combinatorial expansion of matching. In this paper we propose a re-projection based 2D line matching and 3D line detection algorithm using weak matching model between 2D lines and 3D points. The algorithm firstly re-projects 3D spatial point clouds to each camera plane to find the best re-projection line, and then reasons a 3D line segment by back-projection. In addition, a plane clustering algorithm is employed to remove matching errors caused by re-projection ambiguity. Experimental results have demonstrated the efficiency and effectiveness of our algorithm for 3D line detection. Our algorithm can obtain satisfactory results for planar models and performs robustly to curve and occlusion models. In future work, we plan to apply our method to the image-based modeling and rendering system with large-scale scene.
3D Line Segment Detection for Unorganized Point Clouds
411
Acknowledgement. This work was supported by NSFC fund (60873085) and National Hi-Tech Development Programs under grants No. 2007AA01Z314 and No. 2009AA01Z332, P. R. China.
References 1. Agarwal, S., Snavely, N., Simon, I., Seitz, S., Szeliski, R.: Building rome in a day. In: Proc. ICCV, pp. 72–79 (2009) 2. Bay, H., Ferrari, V., Gool, L.: Scene segmentation using the wisdom of crowds. In: Proc. CVPR, pp. 329–336 (2005) 3. Eden, I., Cooper, D.: Inferring temporal order of images from 3d structure. In: Proc. ECCV, pp. 172–185 (2008) 4. Fan, B., Wu, F., Hu, Z.: Line matching leveraged by point correspondences. In: Proc. CVPR, pp. 390–397 (2010) 5. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view stereopsis. IEEE Tran. on PAMI 32, 1362–1376 (2010) 6. Furukawa, Y., Curless, B., Seitz, S., Szeliski, R.: Towards internet-scale multi-view stereo. In: Proc. CVPR, pp. 1434–1441 (2010) 7. Gioi, R., Jakubowicz, J., Morel, J., Randall, G.: Lsd: A fast line segment detector with a false detection control. IEEE Tran. on PAMI 32, 722–732 (2010) 8. Heuel, S., Forstner, W.: Matching, reconstructing and grouping 3d lines from multiple views using uncertain projective geometry. In: Proc. CVPR, pp. 517–524 (2001) 9. Ning, X., Zhang, X., Wang, Y., Jaeger, M.: Segmentation of architecture shape information from 3d point cloud. In: Int. Conf. on Virtual Reality Continuum and its Applications in Industry, pp. 127–132 (2009) 10. Schmid, C., Zisserman, A.: Automatic line matching across views. In: Proc. CVPR, pp. 666–671 (1997) 11. Schmid, C., Zisserman, A.: The geometry and matching of lines and curves over multiple views. Int. Journal of Computer Vision 40, 199–233 (2000) 12. Sinha, N.S., Steedly, D., Szeliski, R.: Piecewise planar stereo for image-based rendering. In: Proc. ICCV, pp. 1881–1888 (2009) 13. Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: exploring photo collections in 3d. ACM Transactions on Graphics 25, 835–846 (2006) 14. Snavely, N., Seitz, S., Szeliski, R.: Modeling the world from internet photo collections. Int. Journal of Computer Vision 80, 189–210 (2008) 15. Snavely, N.: Scene reconstruction and visualization from internet photo collections. PhD thesis, University of Washington (2008) 16. Wang, L., Neumann, U., You, S.: Wide-baseline image matching using line signatures. In: Proc. ICCV, pp. 1311–1318 (2009) 17. Woo, D., Han, S., Park, D., Nguyen, Q.: Extraction of 3d line segment using digital elevation data. Int. Congress on Image and Signal Processing, 734–738 (2008) 18. Woo, D., Park, D.: 3d line segment detection based on disparity data of area-based stereo. In: Proc. of WRI Global Congress on Intelligent Systems, vol. 4, pp. 219–223 (2009)
Multi-View Stereo Reconstruction with High Dynamic Range Texture Feng Lu, Xiangyang Ji, Qionghai Dai, and Guihua Er TNList and Department of Automation, Tsinghua University, Beijing, China
Abstract. In traditional 3D model reconstruction, the texture information is captured in a certain dynamic range, which is usually insufficient for rendering under new environmental light. This paper proposes a novel approach for multi-view stereo (MVS) reconstruction of models with high dynamic range (HDR) texture. In the proposed approach, multi-view images are firstly taken with different exposure times simultaneously. Corresponding pixels in adjacent viewpoints are then extracted using a multi-projection method, to robustly recover the response function of the camera. With the response function, pixel values in the differently exposed images can be converted to the desired relative radiance values. Subsequently, geometry reconstruction and HDR texture recovering can be achieved using these values. Experimental results demonstrate that our method can recover the HDR texture for the 3D model efficiently while keep high geometry precision. With our reconstructed HDR texture model, high-quality scene re-lighting is exemplarily exhibited.
1
Introduction
3D model is created as a mathematical representation of any 3D surface of objects in the scene via specialized methodologies. It can be displayed with adequate appearance information of virtual objects through 3D rendering. Therefore, 3D model is widely used in applications such as industrial design, 3D video game, digital film-making, and advanced education. With the rapid development of computer vision and computer graphics, a lot of novel technologies for reconstructing 3D models of real-world objects are coming into sight, in which multi-view stereo (MVS) has drawn more and more attentions from researchers in recent years [17]. MVS reconstructs watertight 3D model from multiple images captured at different viewpoints. Existing MVS solutions only capture multi-view images in a fixed low dynamic range (LDR). These images are unable to present complete information of scene radiance due to limited dynamic range determined by the exposure settings. Thus, the 3D model recovered from these images only has LDR texture, which is not applicable in specified applications, e.g. whose require rendering with environmental light variation. High dynamic range (HDR) imaging can be fulfilled via cameras with specified design [16]. However, it is too expensive and cannot be widely used in practice. As a result, most research works focused on fusing temporally captured R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 412–425, 2011. Springer-Verlag Berlin Heidelberg 2011
Multi-View Stereo Reconstruction with High Dynamic Range Texture
413
multi-exposure images to generate HDR images [5, 7]. In these approaches, pixel alignment for moving object is a hard work. As a result, it is usually inefficient for response function recovery and HDR image fusing. Besides, these methods cannot be directly applied to generate HDR texture in MVS scenario. Multicamera setup throws fresh light on HDR texture recovery based on inter-view correspondence. However, as for now, no research work in terms of HDR texture based MVS has been reported. This paper proposes a novel MVS algorithm allowing HDR texture recovery from multi-view LDR images. The approach mainly goes through three phases: multi-view multi-exposure image capture, film response recovery and image calibration, and HDR texture 3D model reconstruction, as shown in Fig. 1. As a result, without any additional hardware cost compared to traditional multicameras based MVS, HDR texture which contains more information of the original scene radiance can be generated.
Fig. 1. Our proposed approach: the three phases to reconstruct 3D model with HDR texture
The remainder of the paper is organized as follows: Section 2 provides the background of HDR image recovery. In Section 3, HDR texture 3D model reconstruction approach will be described in detail. Section 4 presents experimental results and Section 5 concludes the paper.
2
Background
This section overviews the imaging principle of conventional LDR imaging systems. Then several related works on radiance map recovery and HDR images/ videos generation are discussed. 2.1
Scene Radiance and Camera Response Function
The “scene brightness” in a real-world scene refers to the radiance L of the object surface patches in the scene, whereas what a typical imaging system really
414
F. Lu et al.
measures is the sensor irradiance E, which indicates the “image brightness”, and is given in [4]: π d E = L ( )2 cos4 α (1) 4 f where d is the diameter of the aperture, f is the focal length, and α is the angle that the optical axis makes with the ray from the object patch to the center of the lens. With the reasonable assumption that the terms of cos4 α and ( fd )2 don’t vary from pixels, the image irradiance E is actually proportional to the scene radiance L. During the process of filming, within a particular time referred as exposure time Δt, the total energy collected by the sensor for a pixel should be: π d X = EΔt = L ( )2 cos4 αΔt = kLΔt 4 f
(2)
Thus X is the exposure amount. Imaging systems record the exposure amount and convert it into image value Z, as Z = f (X). Here, the function f is called the response function, and is designed to be non-linear. It is reasonable to assume that function f is monotonically increasing, so f −1 can be defined. Hereby, knowing the exposure time Δt, X can be calculated from image value Z by X = f −1 (Z). And since X/Δt is proportional to the scene radiance L, f −1 (Z)/Δt can be regarded as relative scene radiance. 2.2
HDR Recovering
The increasing demand for HDR images and videos has inspired many solutions in the past years. For typical digital system, the available number of values for a pixel is limited, e.g. 256 levels in 8bit system, and thus the dynamic range of the captured scene radiance is narrowed. The most straightforward idea is to develop novel sensors with built-in HDR functions instead of conventional cameras [14, 15]. However, these special sensors are always expensive and not widely available. As a result, most research works focus on how to convert pixel values to relative scene radiance values, by using the camera response function obtained from multiple images captured with different exposures for the same scene . Mann and Picard model the response function by Z = α + βX γ [12]. Their solution is simple and feasible, but the recovery precision is unsatisfactory because of the highly restricted function form. Debevec and Malik propose a method only requires a smoothness assumption [2]. With the known exposure times for each image, their method achieves most accurate results. Mitsunaga and Nayar consider the situation where the exposure times are unobtainable, and model the response functions with high-order polynomials which can approximate the response functions accurately enough [13]. These methods require highly accurate pixel alignments across different images, which usually cannot be guaranteed in dynamic scene capture because
Multi-View Stereo Reconstruction with High Dynamic Range Texture
415
of object motion. Jacobs et al. proposes a LDR image alignment method to deal with both object and camera movements [5]. Nevertheless they have to assume that the movements are small and only introduce Euclidean transformation. Kang et al. recover HDR video by well implemented motion estimation [7]. However, artifacts such as ghost phenomenon still exist in fast moving scenes. Recently, Lin et al. present a novel method to generate HDR image for stereo vision [9]. Motion estimation is avoided in this method. However, similar problem of pixel alignment between different viewpoint images exists, which makes it difficult to handle the multi-view images with large disparities.
3
The Approach
This section presents our MVS approach for recovering HDR texture 3D model from multi-view LDR images. The approach is mainly composed of three stages (see Fig. 1): multi-view multi-exposure image capture, film response recovery and image calibration, and HDR texture 3D model reconstruction. 3.1
Multi-view Multi-exposure Image Capture
In our approach, multi-camera system is used to capture the multi-view images for MVS reconstruction. And the exposure times are set to be different for adjacent cameras, in order to jointly recover the HDR texture for the scene objects. Capture system. To support the multi-view stereo approach, we have developed a multi-camera multi-light dome for multi-view capture. Inside the dome, 20 cameras are available which located evenly on a ring as shown in Fig. 2(a). Multi-view images/videos with spatial resolution of 1024×768 can be captured simultaneously. What’s more, the exposure time of each camera can be individually configured, which meets the requirement of HDR texture recovery.
Fig. 2. Multi-view multi-exposure image capture: (a) dome system, cameras in the near-side are marked with circles; (b) camera setups: “A” and “B” represent cameras with two different exposure times; (c) captured multi-view multi-exposure images
416
F. Lu et al.
Camera setup. All of the 20 cameras are divided into 2 groups and utilized for multi-view capture. Cameras in the first group have exposure time of Δt1, while those in the second group have exposure time of Δt2. The values of Δt1 and Δt2 are chosen to capture as large as possible the contiguous dynamic range of the scene radiance. Selecting cameras for each group is another key point. As shown in Fig. 2(b), every 4 neighboring cameras starting from position 2k − 1, k = 1, 2, . . . constitute a subgroup, which contains three cameras with the same exposure time and one camera with another exposure time. For instance, the first 3 subgroups in Fig. 2(b) are “AABA”, “BABB” and “BBAB”. Notice that the camera with unique exposure time in the subgroup never appears at the subgroup border. And the total camera numbers of the two groups remain the same. Examples of captured multi-view multi-exposure images are shown. In Fig. 2(c), images from 4 adjacent viewpoints from image set “plaster figure” captured under the introduced camera setup are illustrated. 3.2
Film Response Recovery and Image Calibration
Under our camera setup, multi-view multi-exposure images are captured. In different images, the corresponding pixels which record the same scene point are very different in values and image positions. A multi-projection method is designed to extract corresponding pixels from different multi-view multi-exposure images accurately. And then film response recovery and image exposure calibration can be done. Here we assume a controlled environment in which the cameras can be easily geometrically calibrated. And in this case the camera response functions may be recovered by some other pre-calculation methods. However, considering the uncontrolled environment such as outdoor scene capture, our technique will be essential together with other techniques such as self-calibration to achieve the goal. Pixel correspondences extraction. Response function recovery requires highly accurate pixel alignment. For single viewpoint HDR imaging, positions of corresponding pixels in different images are the same. However, for multi-view imaging, the filming positions of the same scene point vary through different images. For our captured multi-view multi-exposed images, take one subgroup defined in Section 3.1 for example. As shown in Fig. 3, images A, B, and D have the same exposure time which is different from that of image C. For any specified pixel pb in image B, an optical ray starting from pb and passing through the camera center can be calculated using the camera parameters. It is obvious that the real-world scene point x recorded by pb is located on this ray. The only unknown variable to determine the real-world coordinate of x is the distance d from the camera center to x along the optical ray. To validate effective search for the value of d, visual hull [3] of the target object is firstly computed to restrict the search range. Since generation of visual hull only requires silhouette information of the object, the multi-exposed images can be directly used before exposure calibration.
Multi-View Stereo Reconstruction with High Dynamic Range Texture
417
Fig. 3. Correspondence extraction in multi-exposed images
Within the restricted range along the optical ray, possible 3D points are evenly sampled first. Each candidate x is projected back to pixels pa and pd in images A and D. Since pa , pb and pd are filmed with the same exposure time, pixel consistency can be easily evaluated. Here, zero mean normalized cross correlation (ZN CC) is used as an evaluation metric. For two square windows of N × N centering on pixels p and q in two images, ZN CC is computed by: N 2
ZN CC(p, q) =
i=1 N 2
i=1
(zp (i) − zp ) · (zq (i) − zq )
(zp (i) − zp )2 ·
N 2 i=1
(3) (zq (i) − zq )2
where zp (i) and zq (i) are the values of the i-thpixels inside the N × N regions N 2 N 2 centering on p and q, while zp = ( j=1 zp (j)) N 2 and zq = ( j=1 zq (j)) N 2 denote the average pixel values in these regions. In our case, with different x , two ZN CC curves are calculated: one is for areas centering on pb and pa and the other is for areas centering on pb and pd . According to [19], a local maximum method is used to find the most corresponding solution from the ZN CC curves. Finally, with the best x , corresponding pixel values of pb and pc are restored as a corresponding value pair (zb ,zc ). Note that under our camera setup, three images with the same exposure time can be used to determine the scene point position, which increases the robustness of the algorithm. Also to note that only pixel values of pb and pc are finally kept. This is because images B and C are most correlated and closed to the optical ray, and thus they are most efficient for film response recovery. Fig. 4(a) illustrates the corresponding pixels extracted by our method for multi-view multi-exposure images. Only small portion (1%) of the total pixels correspondences are drawn to make them easy to be identified. Fig. 4(b) gives another result by using SIFT descriptor [8] which is used to match pixels for gray-scale stereoscopic images in [9]. It is obvious that our method achieves much higher precision and the number of effective points is more than sufficient. On the other hand, although the incorrect matchings in SIFT-based method can be partially removed by adding other constraints, it will result in inadequate
418
F. Lu et al.
Fig. 4. Correspondence matching results: (a) our method and (b) SIFT-based method
number of the corresponding points for film response function recovery. This result proves the effectiveness of our geometry constraint based method. Film response recovery. To recover the camera response function, we use the method proposed by Debevec and Malik [2] since the exposure times in our system are available. The extracted corresponding pixels offer large number of corresponding value pairs, denoted as Z = {(zΔt1 , zΔt2 )}. It is usually not necessary to use all the values for recovering response function. On the other hand, due to noise and mismatch, not all the value pairs are valid. In order to remove the inaccurate value pairs, for each value pair, outliers are detected by examining if zΔt1 or zΔt2 has the deviation over certain thresholds. Considering the filming characteristics and saturate phenomenon, for exposure times Δt1 > Δt2, reasonable thresholds may be: c , f or 0 < zΔt1 < 200 threshold(zΔt1 ) = c + (zΔt1 − 200), f or zΔt1 > 200 (4) c , f or zΔt2 > 50 threshold(zΔt2 ) = c + (50 − zΔt2 ), f or 0 < zΔt2 < 50 where c is a small constant to indicate the basic tolerance of the inaccuracy. After the inaccuracy removal, N value pairs are randomly selected from Z to ∗ ∗ form a new set Z ∗ = {(zΔt1 (i), zΔt2 (i))}, i = 1, . . . , N , with the only constraint that they should be well-distributed. Using the data in Z ∗ , camera response function g(z) can be computed by solving the problem: min{ +λ
N 2
i=1 j=1 254 z=1
∗ [w(zΔt (i)) − ln(Ei ) − ln(Δtj )]2 j
(5)
[w(z)(g(z − 1) − 2g(z) + g(z + 1))]2 }
where w(z) is the weight function introduced in [2]. Here we have assumed that all the cameras in our capture system have the same response function g(z). This assumption is reasonable in our case because all the cameras in our dome system are with the same type and color calibration is done before capturing.
Multi-View Stereo Reconstruction with High Dynamic Range Texture
419
Image calibration. With the recovered response function g(z), exposure calibration is easy to process. For one image with the exposure time Δt, the relative radiance E for any pixel value z can be estimated by ln(E) = g(z) − ln(Δt). Similarly, all the differently exposed images can be converted to relative radiance maps. In practice, to ensure compatibility with the following steps, we just convert images from exposure time Δt1 to Δt2 by modifying each pixel value by: zΔt2 = g −1 (ln(Δt2) + g(zΔt1 ) − ln(Δt1)). It is also feasible to convert all the images to a new exposure time Δt3. The exposure calibration results will be included in Section 4.1, which demonstrate that the exposure calibrated multiview images can be satisfactorily obtained. And thus the effectiveness of 3D model reconstruction via MVS can be guaranteed. 3.3
HDR Texture 3D Model Reconstruction
In this step, the exposure calibrated multi-view multi-exposure images are used to finally reconstruct the 3D model of the real-world object by MVS algorithm. And then, HDR texture of the model is generated by comprehensively utilizing the reconstructed geometry information, the recovered camera response function and the color information from the original LDR images. Geometry reconstruction. We adopt the point cloud based multi-view Stereo algorithm (PCMVS) [10] to reconstruct the 3D model. The PCMVS algorithm belongs to the multi-stage local processing MVS and produces outstanding performance among current MVS algorithms under spares viewpoint setups. PCMVS consists of three main steps: point cloud extraction, merging and meshing. The decoupled designing of these steps allows modification and utilization of any of the modules independently, which is another advantage for PCMVS being adopted in our approach. Exposure calibrated images usually contain “textureless” and noisy areas which may result in failure in MVS algorithm. It is similar with the problem caused by shading effects or lack-of-texture which traditional MVS methods are facing. In PCMVS, robust methods are designed to solve this kind of problems. Thus, we simply extend the functional range of these methods to cover the new problem. Results will be shown in Section 4.2 to demonstrate the effectiveness of the adopted PCMVS algorithm in exposure calibrated images based geometry reconstruction. HDR texture recovery. Once the 3D model is reconstructed, it can be used to generate HDR texture. More precisely, we recover the color information for each vertex v on the 3D mesh. Color value for a certain vertex v on the mesh can be calculated by fusing its corresponding pixel colors in multiple images. Note that it cannot be foreknown which images contain the corresponding pixels of the vertex v, therefore the visibility of v in each image has to be computed first. Then, knowing that v is visible in images {pj }, and the corresponding pixel values are {zj }, the relative
420
F. Lu et al.
radiance value of v is computed as the weighted average of the relative radiance values of each pj : w(zj )(g(zj ) − ln(Δt(pj ))) j (6) ln(Ev ) = w(zj ) j
where w(zj ) is the same weight function used in film response recovery. In our implementation, two stages of pre-processing are implemented to improve the recovery quality. First, images {pj } in which v is visible are sorted according to their view angles made by the vertex normal of v and the optical ray from v to their camera centers. Image with smaller view angle is moved closer to the top of the queue. This is because small view angle is more likely to guarantee color consistency. In practice, 2-4 images with least view angles are used to calculate the relative radiance value. The second stage is introduced to deal with the ghost phenomenon. Ghost phenomenon is one of the most common problems in HDR image/video generation and is frequently discussed [5, 6]. It is basically caused by misalignment of pixels in multiple images with object motion. In our approach all the images are captured simultaneously, and thus motion is not a problem. Besides, multiple candidates of corresponding pixels are available to allow robust color recovery by selecting valid pixel values and discarding the outliers from {zj }. Then the misaligned pixels are excluded and ghost phenomenon is efficiently removed. Our method computes the relative radiance value for vertex v properly. The relative radiance map is used as the HDR texture of the 3D model directly or through certain processes. To further generate LDR textures under desired illuminance, just look up pixel values in the camera response curve with hypothetic exposure times or refer to existing tone-mapping methods [1, 11].
4
Results
In this section, the performance of our proposed approach for HDR texture model reconstruction is tested and the results are presented. Experiments use multi-view multi-exposure images captured for static scenes. Because to carry out comparative experiments, it is hard to involve dynamic scenes in respect that they are basically unrepeatable. However, our approach can handle dynamic scene directly since each temporal model is constructed independently. 4.1
Recovery of Camera Response Curve
Response function is firstly recovered based on our captured images. 20 images are divided into 2 groups by their exposure times. Then algorithm introduced in Section 3.2 is applied to recover the camera response function. The multi-view multi-exposure image set “plaster figure” (Fig. 2(c)), which is captured in two
Multi-View Stereo Reconstruction with High Dynamic Range Texture
421
Fig. 5. Camera response curves: (a) our result; (b) average curve of 20 cameras; (c) curve of view 1; (d) curve of view 2
exposure times of 30ms and 10ms, is used as an example. The recovered average response curve for each color channel (R, G, and B) of 20 cameras are illustrated in Fig. 5(a). To verify the feasibility of the recovered response curve, contrast experimental results are provided as references. Response functions for each camera are computed separately by taking multiple exposed images with a single camera and thus precise results can be obtained. Fig. 5(b) shows the average of the 20 referential camera response curves. Although there are slight differences, curves in Fig. 5(a) and Fig. 5(b) indicate that our method achieves fairly satisfactory result in camera response recovery without requiring temporal multi-exposure capturing. Furthermore, Fig. 5(c) and (d) show the referential camera response curves of viewpoint 1 and viewpoint 2. In fact, note that each individual camera response curve looks similar to others, thus they can be substituted by the average curve. Therefore a conclusion can be reached that using one recovered curve for all cameras in our approach is reasonable and feasible. With the recovered response curve, image exposure calibration can be done according to Section 3.2. Fig. 6 illustrates the calibration results of “plaster figure”. Images (a) is captured in exposure time of 30ms while (b) and (d) are in 10ms. Images (c) and (e) are the results of exposure calibration, which now have the same exposure intensity with (a).
Fig. 6. Exposure calibration: (a), (b) and (d): original captured images in two different exposure times; (c) and (e): exposure calibration results of (b) and (d)
422
F. Lu et al.
Clearly, the exposure calibration can be done accurately enough for 3D reconstruction. However, it should still be pointed out that because of the low dynamic range, images through exposure calibration cannot recover all the color information, and texture in some regions especially the saturated areas appears to be artificial and noisy. Generating HDR texture for 3D models aims to solve this problem and is just the motivation of our work. 4.2
Reconstruction of Geometry
Fig. 7 compares the geometry reconstruction results between visual hull [3], PCMVS [10] and our method. Our method uses the exposure calibrated images as inputs, while both visual hull and PCMVS use the images captured with carefully determined single exposure time to ensure the reconstruction quality. Obviously, visual hull performs the worst among these methods. And for PCMVS and our results, it is hard to tell the difference between them even with careful observation. It proves that our method achieves the same geometry reconstruction accuracy as PCMVS while using multi-exposed images as inputs.
Fig. 7. Geometry reconstruction results. From left to right: original captured images; visual hull results; PCMVS results; our results.
Note that the geometry reconstruction module in our approach is totally decoupled from other modules, and hereby our approach is flexible to adopting any better MVS algorithms (e.g. [18]) to improve the reconstruction result. 4.3
Recovery of HDR Texture
The method for generating HDR texture is introduced in Section 3.3 and the recovered relative radiance map is used as the HDR texture after being taken the logarithm of is values. Fig. 8 illustrates the models of “plaster figure” with HDR texture (remapped to 0-255) generated by our method and LDR texture generated by the single exposure method. It is obvious that the HDR texture contains more color information in both bright and dark regions.
Multi-View Stereo Reconstruction with High Dynamic Range Texture
423
Fig. 8. Results of HDR texture recovery and rendering for “plaster figure”. For the rendering results, the top and middle re-map the bright region, the bottom re-map the dark region.
Fig. 9. Another result of HDR texture recovery and rendering for “article objects”. Ellipse regions in the middle image show the local rendering results from HDRT model.
Luminance rendering results are further compared in Fig. 8. For HDR texture, different value ranges of the logarithmic relative radiance map are remapped to 0-255 to form the new textures; and for LDR texture, same ranges of the original 8bit color values are scaled to 0-255 as well. As shown in Fig. 8, the first two rendering settings aim to extract the details in the bright areas. Clearly, HDR texture rendered results derive plenty of details while the LDR texture fails. This is because the color information which LDR texture contains is highly limited and depends on the exposure setting. Thus bright areas with most pixel values saturated cannot offer any extra information in re-lighting. The last rendering results in Fig. 8 represent the textures under relatively strong illuminance. This
424
F. Lu et al.
time the LDR texture rendered result is pretty good since the original LDR images are captured in a relatively long exposure time (30ms). On the other hand, HDR rendered result is still satisfactory. Fig. 9 shows contrast experimental results with reconstructed models “article objects”. The models with HDR and LDR texture are shown in the left and right, and the HDR texture rendered results can be seen in the three ellipses in the middle image. Notice the regions around the box and the book, most pixel values there are saturated in the LDR texture, which indicates that no details can be recovered. If shorter exposure time is used to capture the detail, the bag at the bottom would be totally invisible. On the contrary, the HDR texture keeps details both in the bright and dark regions within its high dynamic value range.
5
Conclusion
In this paper, we present a efficient approach for reconstructing 3D model with high dynamic range texture (HDRT) via multi-view stereo (MVS). Although both multi-view stereo and high dynamic range imaging attract lots of attentions from researchers, no joint research has been reported by now. Our approach includes three stages: multi-view multi-exposure image capturing; high-precision camera response recovery and image calibration; and finally 3D model reconstruction with HDR texture. Experimental results have demonstrated the effectiveness of our approach, and also the superiority of using these HDR texture models in illuminance rendering. Future works will include enhancing the geometry reconstruction precision by adopting new MVS algorithms or even utilizing our multi-exposure images to improve the existing MVS algorithms. Acknowledgement. This work has been supported by the National Basic Research Project of China (973 Program No.2010CB731800), the Key Program of NSFC (No.60933006), and the Program of NSFC (No.60972013).
References 1. Drago, F., Martens, W.L., Myszkowski, K., Chiba, N., Rogowitz, B., Pappas, T.: Design of a tone mapping operator for high dynamic range image based upon psychophysical evaluation and preference mapping. In: HVEI 2003, pp. 321–331 (2003) 2. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: ACM SIGGRAPH 1997, pp. 369–378 (1997) 3. Franco, J.S., Lapierre, M., Boyer, E.: Visual shapes of silhouette sets. In: 3DPVT 2006, pp. 397–404 (2006) 4. Horn, B.K.: Robot Vision. McGraw-Hill Higher Education, New York (1986) 5. Jacobs, K., Loscos, C., Ward, G.: Automatic high-dynamic range image generation for dynamic scenes. IEEE Computer Graphics and Applications 28, 84–93 (2008) 6. Khan, E.A., Akyuz, A.O., Reinhard, E.: Ghost removal in high dynamic range images. In: ICIP 2006, pp. 2005–2008 (2006)
Multi-View Stereo Reconstruction with High Dynamic Range Texture
425
7. Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High dynamic range video. ACM Transactions on Graphics 22, 319–325 (2003) 8. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 9. Lin, H.Y., Chang, W.Z.: High dynamic range imaging for stereoscopic scene representation. In: ICIP 2009, pp. 4305–4308 (2009) 10. Liu, Y., Dai, Q., Xu, W.: A point cloud based multi-view stereo algorithm for freeviewpoint video. IEEE Transactions on Visualization and Computer Graphics 16, 407–418 (2009) 11. Mantiuk, R., Daly, S., Kerofsky, L.: Display adaptive tone mapping. ACM Transactions on Graphics 27, 1–10 (2008) 12. Mann, S., Picard, R.W.: Being ‘undigital’ with digital cameras: extending dynamic range by combining differently exposed pictures. In: IST’s 48th Annual Confrence, pp. 422–428 (1995) 13. Mitsunaga, T., Nayar, S.K.: Radiometric self calibration. In: CVPR 1999, pp. 1374– 1380 (1999) 14. Nayar, S., Branzoi, V.: Adaptive dynamic range imaging: optical control of pixel exposures over space and time. In: Proc. ICCV 2003, pp. 1168–1175 (2003) 15. Nayar, S., Branzoi, V., Boult, T.: Programmable imaging using a digital micromirror array. In: CVPR 2004, pp. 436–443 (2004) 16. Nayar, S., Mitsunaga, T.: High dynamic range imaging: spatially varying pixel exposures. In: Proc. CVPR 2000, pp. 472–479 (2000) 17. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR 2006, pp. 519–528 (2006) 18. Vlasic, D., Peers, P., Baran, I., Debevec, P., Popovi´c, J., Rusinkiewicz, S., Matusik, W.: Dynamic shape capture using multi-view photometric stereo. ACM Transactions on Graphics 28, 1–11 (2009) 19. Vogiatzis, G., Hern´ andez, C., Torr, P., Cipolla, R.: Multiview stereo via volumetric graph-cuts and occlusion robust photo-consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 2241–2246 (2007)
Feature Quarrels: The Dempster-Shafer Evidence Theory for Image Segmentation Using a Variational Framework Bj¨ orn Scheuermann and Bodo Rosenhahn Institut f¨ ur Informationsverarbeitung Leibniz Universit¨ at Hannover, Germany {scheuermann,rosenhahn}@tnt.uni-hannover.de
Abstract. Image segmentation is the process of partitioning an image into at least two regions. Usually, active contours or level set based image segmentation methods combine different feature channels, arising from the color distribution, texture or scale information, in an energy minimization approach. In this paper, we integrate the Dempster-Shafer evidence theory in level set based image segmentation to fuse the information (and resolve conflicts) arising from different feature channels. They are further combined with a smoothing term and applied to the signed distance function of an evolving contour. In several experiments we demonstrate the properties and advantages of using the DempsterShafer evidence theory in level set based image segmentation.
1
Introduction
Image segmentation is a popular problem in the field of computer vision. It is the first step in many fundamental tasks which pattern recognition and computer vision have to deal with. The problem has been formalized by Mumford and Shah as the minimization of a functional [1]. Level set representations or active contours [2–4] are a very efficient way to find minimizers of such a functional. As shown by many papers and textbooks on segmentation [2–8], a lot of progress has been made using these variational frameworks. However it still faces several difficulties which are usually caused by violations of model assumptions. For example, homogeneous [2] or smooth [1] object regions are usually assumed. Due to noise, occlusion, texture or shading this is often not appropriate to delineate object regions. Statistical modeling of regions [5] and supplement of additional information such as texture [6], motion [7] or shape priors [8] increases the number of scenes where image segmentation using variational frameworks succeeds. Another reason for failed segmentations is a unsuitable or false initialization, which is necessary to find a local minimizer for a segmentation functional. Earlier works on image segmentation using the Dempster-Shafer theory of evidence have been presented in [9–12] and in [13]. These works combined the evidence theory with either a simple threshholding [9], a decisional procedure R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 426–439, 2011. Springer-Verlag Berlin Heidelberg 2011
The Dempster-Shafer Evidence Theory for Image Segmentation
(a) Bayes
(b) Dempster-Shafer
(c) Bayes
427
(d) Dempster-Shafer
Fig. 1. Segmentation results using either the probabilistic Bayesian model or Dempster-Shafer evidence theory to fuse information
[10], a fuzzy clustering algorithm [11], a region merging algorithm [12] or a k-means clustering algorithm [13]. All cited works use the idea of DempsterShafer to fuse the informations arising from three color channels. In contrast to these works we propose to combine the evidence theory with a variational segmentation framework, including statistical modeling of regions, the respective Euler-Lagrange equations and a smoothness term. 1.1
Contribution
In this paper we propose to use the Dempster-Shafer theory of evidence [14, 15] to combine several information sources in a variational segmentation framework. This theory of evidence is an elegant way to fuse information from different feature channels, e.g., color channels and texture information, and it offers an alternative to traditional probabilistic theory. Instead of using the probabilistic Bayesian model to fuse separated probability densities we deal with the Dempster-Shafer evidence theory which allows to represent inaccuracy and uncertainty at the same time. A prerequisite for the Dempster-Shafer theory of evidence is a mass function. We show how to define these functions and combine them to derive the so called belief function. This belief function represents the total belief for a hypothesis and can be integrated into a variational segmentation framework to assign more support to high probabilities (see Figure 1). Note, that image segmentation is just our example application, but we strongly believe that the Dempster-Shafer evidence theory can be advantageous for other applications in Computer Vision or Machine Learning as well. Therefore, this theory is of general interest for our research community. 1.2
Paper Organization
In Section 2 we continue with a review of the variational approach for image segmentation, which is the basis for our segmentation framework, and a presentation of the Dempster-Shafer theory of evidence. Section 3 introduces the
428
B. Scheuermann and B. Rosenhahn
proposed segmentation method which combines the variational segmentation method and the Dempster-Shafer theory of evidence. Experimental results in Section 4 demonstrate the advantages of the proposed method. The paper finishes with a short conclusion.
2 2.1
Background Image Segmentation Using a Variational Framework
Our variational segmentation framework is based on the works of [2, 16]. Using the level set formulation for the general problem of image segmentation has several well known advantages, e.g. the naturally given possibility to handle topological changes of the 1-D boundary curve. This is especially important if the object is partially occluded by another object or if the object consist of multiple parts. In case of a two-phase segmentation, the level set function ϕ : Ω → R splits the image domain Ω into two regions Ω1 , Ω2 ⊆ Ω with ≥ 0, if x ∈ Ω1 ϕ(x) = . (1) < 0, if x ∈ Ω2 The zero-level line of the function ϕ(x) represents the boundary between the object, which is sought to be extracted, and the background. A successful remedy to extend the number of situations in which image segmentation can succeed is the use of additional constraints, for instance the restriction to a certain object shape [17, 18]. However in this work we focus on segmentation without using shape priors. The basis for our segmentation framework is the so called Chan-Vese energy functional [2]: H(ϕ) log p1 + (1 − H(ϕ)) log p2 dΩ E(ϕ) = − Ω (2) +ν |∇H(ϕ)| dΩ , Ω
where ν ≥ 0 is a weighting parameter between the given constraints, pi are probability densities and H(s) is a regularized Heaviside function. The regularized Heaviside function is needed to build the Euler-Lagrange equation and to make it possible to indicate at each iteration step to which region a pixel belongs. Given the two probability densities p1 and p2 of Ω1 and Ω2 the total a-posteriori probability is maximized by minimizing the first term, i.e. pixels are assigned to the most probable region according to the Bayes rule. The second term penalizes the length of the contour and act as a smoothing term. Minimization of the Chan-Vese energy functional (2) can be easily performed by solving the corresponding Euler-Lagrange equation to ϕ. This leads to the following partial differential equation
The Dempster-Shafer Evidence Theory for Image Segmentation
∇ϕ p1 ∂ϕ = δ(ϕ) log + ν div , ∂t p2 |∇ϕ|
429
(3)
where δ(s) describes the derivative of H(s) with respect to its argument. Starting with some initial contour ϕ0 and given the probability densities p1 and p2 one has to solve the following initial value problem ⎧ 0 for x ∈ Ω ⎨ϕ(x, 0) = ϕ . (4) ∇ϕ p1 ∂ϕ ⎩ = δ(ϕ) log + ν div ∂t p2 |∇ϕ| Thus, the quality of the segmentation process is limited by the initial contour and the way the foreground and background probability densities p1 and p2 are modeled. In this paper we used the nonparametric Parzen estimates [6], which is a well known histogram-based method. This Model was chosen since compared to multivariate Gaussian Mixture Models (GMM) it leads to comparable results without estimating so many parameters. Other possibilities to model the probability densities given the image cues are, e.g., a Gaussian density with fixed standard deviation [2] or a generalized Laplacian [19]. For the proposed method it is necessary to extend this segmentation framework from gray scale images to feature vector images I = (I1 , . . . , Im ). This extension is straight forward and has been applied to the Chan-Vese model [20]. For this extension it is assumed that the channels Ij are independent, so that the total a posteriori probability density pi (I) of region Ωi is the product of the separated probability densities pi,j (Ij ). Thus the Chan-Vese model (2) for vector images reads: m m E(ϕ) = − H(ϕ) log p1,j dΩ − (1 − H(ϕ)) log p2,j dΩ Ω Ω j=1 j=1 (5) +ν Ω
|∇H(ϕ)| dΩ
and yields to the following Euler-Lagrange equation: ⎡ ⎤ m ∇ϕ ⎦ p1,j ∂ϕ = δ(ϕ) ⎣ log + ν div . ∂t p |∇ϕ| 2,j j=1
(6)
Because of their independency, the pi,j can be estimated for each region i and each channel j separately. In other words, for each pixel in the image m the forem ground probability j=1 log p1,j and the background probability j=1 log p2,j is computed over all feature channels j. Possible image features to be incorporated by means of this model are color, texture [6, 21] and motion [7]. 2.2
Dempster-Shafer Evidence Theory
The Dempster-Shafer evidence theory, also called evidence theory, was first introduced in the late 60s by A.P. Dempster [14], and formalized in 1976 by G. Shafer [15].
430
B. Scheuermann and B. Rosenhahn
This theory is often described as a generalization of the Bayesian theory to represent inaccuracy and uncertainty information at the same time. The basic idea of the evidence theory is to define a so called mass function on a hypotheses set Ω, also called frame of discernment. Let us note the hypotheses set Ω composed of n single mutually exclusive subsets Ωi , which is symbolized by: Ω = {Ω1 , Ω2 , . . . , Ωn } .
(7)
In order to express a degree of confidence for each element A of the power set ℘(Ω), it is possible to associate an elementary mass function m(A) which indicates all confidences assigned to this proposition. The mass function m is defined by: m : ℘(Ω) → [0, 1] (8) For all hypotheses, m must fulfill the following conditions: m(An ) = 1 . (i) m(∅) = 0 (ii)
(9)
An ⊆Ω
In this context, the quantity m(A) can be interpreted as the belief strictly placed on hypothesis A. This quantity differs from a Bayesian probability function by the totality of the belief which is distributed not only on the simple classes, but also on the composed classes. This modeling shows the impossibility to dissociate several hypotheses. This characteristic is the principal advantage of the evidence theory, but it also represents the principal difficulty of this method. If m(A) > 0, hypothesis A is a so called focal element. The union of all the focal elements of a mass function is called the core N of the mass function N = {A ∈ ℘(Ω) | m(A) > 0} .
(10)
From the basic belief assignment m, a belief function Bel : ℘(Ω) → [0, 1] and a plausibility function P l : ℘(Ω) → [0, 1] can be defined as m(An ) , P l(A) = m(An ) , (11) Bel(A) = An ⊆A
A∩An =∅
with An ∈ ℘(Ω). Bel(A) is interpreted as the total belief committed to hypothesis A, that is, the mass of A itself plus the mass attached to all subsets of A. Bel(A) then is the total positive effect the body of evidence has on a value being in A. It quantifies the minimal degree of belief of the hypothesis A. A particular characteristic of the Dempster-Shafer evidence theory (one which makes it different from probability theory) is that if Bel(A) < 1, then the remaining evidence 1 − Bel(A) needs not necessarily refute A (i.e., support its negation A). That is, we do not have the so-called additivity rule Bel(A) + Bel(A) = 1. Some of the remaining evidence may be assigned to propositions which are not disjoint from A, and hence could be plausibly transferable to A in the light of new information. This is formally represented by the plausibility function P l (see Eq. (11)). P l(A) is the mass of hypothesis A and the
The Dempster-Shafer Evidence Theory for Image Segmentation
431
mass of all sets which intersect with A, i.e. those sets which might transfer their mass to A. P l(A) is the extent to which the available evidence fails to refute A. It quantifies the maximal degree of belief of hypothesis A. The relation between mass function, belief function and plausibility function is described by: m(A) ≤ Bel(A) ≤ P l(A) ∀A ∈ ℘(Ω) (12) Dempster’s rule of combination. The Dempster-Shafer theory has an important operation, the Dempster’s rule of combination, for pooling of evidence from a variety of features. This rule aggregates two independent bodies of evidence defined within the same frame of discernment into one body of evidence. Let m1 and m2 be two mass functions associated to two independent bodies of evidence defined in the same frame of discernment. The new body of evidence is defined by m1 (B)m2 (C) m(A) = m1 (A) ⊗ m2 (A) =
B∩C=A
1−
.
(13)
m1 (B)m2 (C)
B∩C=∅
Dempster’s rule of combination computes a measure of agreement between two bodies of evidence concerning various propositions discerned from a common frame of discernment. Since Dempster’s rule of combination is associative, we can combine information coming from more than two feature channels.
3
Dempster-Shafer Evidence Theory for Variational Image Segmentation
To introduce the proposed method, using the Dempster-Shafer evidence theory to fuse information arising from different feature channels, we consider the following example. Let I = (I1 , I2 ) be a vector image with two feature channels and ϕt with t ≥ 0 be a contour dividing the image into foreground and background. To minimize the energy functional (5), foreground and background probabilities need to be computed over both feature channels. Using the Bayesian model and disregard the smoothing term, a pixel is defined as foreground if 2 p1,i log > 0. (14) p2,i i=1 Our proposed method uses the Dempster-Shafer theory of evidence to fuse the information coming from the feature channels. Therefore, we have to define the mass function m : ℘(Ω) → [0, 1]. In case of a two-phase segmentation, the frame of discernment becomes Ω = (Ω1 , Ω2 ) where Ω1 denotes the object or foreground, and Ω2 the background. We propose to define the mass function mj for all hypotheses as: mj (Ω1 ) = p1,j (I(x)), mj (Ω2 ) = p2,j (I(x)) , mj (∅) = 0 , mj (Ω) = 1 − (p1,j (I(x))) + p2,j (I(x))) ,
(15)
432
B. Scheuermann and B. Rosenhahn
Image Channel 1
...
Image Channel m
compute p1,1 and p2,1
...
compute p1,m and p2,m
compute mass functions mi (A) for i ∈ {1, . . . , m} and A ∈ 2Ω Fuse Information according to Dempster’s rule of combination compute ϕ(tn+1 ) Fig. 2. Proposed image segmentation paradigm based on combining the variational framework and Dempster-Shafer theory of evidence
for j ∈ {1, 2}. In practice, the defined mass functions fulfill Eq. (9). If p1,j +p2,j > 1 normalization is straight forward to fulfill the condition. Dempster’s rule of combination (Eq. (13)) fuses the two bodies of evidence to compute a measure of agreement between both mass functions. We obtain the new mass function m = m1 ⊗ m2 , with Dempster’s rule of combination (13). In case of a two-phase segmentation, the total belief committed to a focal element Ωi is equal to the belief strictly placed on Ωi . Thus we obtain Bel(Ωi ) = m(Ωi ) for i ∈ {1, 2} and Bel(Ω) = 1 .
(16)
If we use the total belief committed to a focal element and disregard the smoothing term, a pixel is defined as foreground if Bel(Ω1 ) log > 0. (17) Bel(Ω2 ) The plausibility function P l(Ωi ), which quantifies the maximal degree of belief of the hypothesis Ωi defines a pixel as foreground if P l(Ω1 ) m(Ω1 ) + m(Ω) log = log > 0. (18) P l(Ω2 ) m(Ω2 ) + m(Ω) Our proposed method uses the total belief committed to foreground or background region. Figure 2 shows the general segmentation paradigm to evolve a contour combining the variational framework and Dempster-Shafer evidence theory.
The Dempster-Shafer Evidence Theory for Image Segmentation 0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02 0.01
0.01 0 0
433
50
100
150
200
250
0 0
50
100
150
200
250
Fig. 3. (a) Illustration of two foreground probability densities (solid red line) and two background probability densities (dashed blue line), (b) product of the two separated densities (Bayesian model), (c) mass functions (given by Dempster’s rule of combination (13)); Using the evidence theory to fuse the densities enlarges the area of the supported foreground
In general, let I = (I1 , . . . , Ik ) be a vector image with k feature channels. Assuming independent feature channels and using the Bayesian model, the total a-posteriori probability of a region is the product of the separated probability densities. With Dempster-Shafer’s theory of evidence, the information arising from the k feature channels is fused using Dempster’s rule of combination, resulting in: (19) m = m1 ⊗ m2 ⊗ . . . ⊗ mk , where mj is defined by Equation (9). Now, the total belief committed to a region is used as the a-posteriori probability of a region. Thus, the new energyfunctional of our proposed method is given by: H(ϕ) log Bel(Ω1 ) dΩ − (1 − H(ϕ)) log Bel(Ω2 ) dΩ E(ϕ) = − Ω Ω (20) +ν |∇H(ϕ)| dΩ Ω
and yields to the following Euler-Lagrange equation: ∂ϕ Bel(Ω1 ) ∇ϕ = δ(ϕ) log + ν div . ∂t Bel(Ω2 ) |∇ϕ|
(21)
Figure 3 shows the differences between both models and clarifies the effect of the new method. Given two foreground and two background probability densities in Figure 3, the product of the separated probability densities is shown in Figure 3 and the total belief function coming from Dempster’s rule of combination in Figure 3. Obviously, the effect of the foreground probability in Figure 3 is larger than in Figure 3, since the Dempster-Shafer theory of evidence supports high probabilities whereas the Bayesian model supports small probabilities. Intuitively this means that the Bayesian model searches for feature channels not supporting a region whereas the Dempster-Shafer theory searches for feature channels that support a specific region. Properties of the Bayesian model, e.g., to include shape priors [22], can be transferred straightforward to the proposed framework by adding terms to the new energy-functional (20).
434
4
B. Scheuermann and B. Rosenhahn
Experimental Results
We introduced a segmentation method which integrates the Dempster-Shafer theory of evidence into a variational segmentation framework. To show the effect of the new approach, we present some results of our proposed method and compare them to segmentation results using the Bayesian framework to fuse information arising from different feature channels. We evaluate our method on real images taken from the Berkeley segmentation dataset [23], our own real images, as well as on synthetic textured images from the Prague texture segmentation data-generator and benchmark [24]. In most of our experiments we used the CIELAB color channels and the smoothing term. This implies that we do not integrate additional information such as texture or shape priors in most of our experiments. To demonstrate the advantages of our framework we integrated texture features for the experiments on the synthetic textured images. A situation in which the advantages of the proposed method become apparent is the synthetic example shown in Figure 4. While the segmentation using the Bayesian model computes different segmentation boundaries for the two initializations, the proposed methods converges to almost identical results for both initializations. The reason for the different segmentation results with the initialization in Figure 4d is a inadequate foreground statistic for one of the two feature channels (red and green). For the upper part of the object, that has to be segmented, this inadequate feature channel (green) supports the foreground with very small probabilities. More formally: ∃j ∈ {1, 2} | p1,j (x) ≈ 0 ∀x ∈ A ,
(22)
while A is the upper part of the object. Using the Bayesian model we obtain p1 (x) = p1,j (x) ≈ 0 . (23) j
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. (a) and (d): Two different initializations for the segmentation methods, (b) and (e): final segmentations using the Bayesian model, (c) and (f): final segmentations using Dempster-Shafer evidence theory. While the proposed method converges to almost identical results, the Bayesian model get stuck in two different local minima, due to one inadequate foreground histogram. Evaluating the energy of the results shows that the energy of (e) is the bigger than the other ones.
The Dempster-Shafer Evidence Theory for Image Segmentation
435
Recall
1
0.5
0 0
0.2
0.4
Precision
0.6
0.8
1
Fig. 5. Precision-Recall-Diagram. Circles mark the performance using DempsterShafer theory of evidence, x marks the performance using Bayes. The black marks mark the mean performance (Dempster-Shafer: 0.93, 0.83; Bayes: 0.81, 0.82).
Because of the noisy background, these statistics have a larger standard deviation than the foreground statistics (see also Figure 3). For our example we obtain p1,1 (x) ∈ [0.024; 0.040] , p1,2 (x) ≈ 0 , p2,1 (x) ∈ [0.006; 0.004] , p2,2 (x) ∈ [0.007; 0.006] ,
(24)
for x ∈ A, which results in p2 (x) =
p2,j (x) > p1 (x)
∀x ∈ A .
(25)
j
Apparently the first feature channel (red) is not considered, even if it supports the foreground with the highest probability. With the Dempster-Shafer evidence theory this problem is solved elegantly since in the aforementioned case we have (26) Bel(Ω1 ) > Bel(Ω2 ) ∀x ∈ A , This may be explained by the fact, that the Dempster-Shafer evidence theory assigns more support to high probabilities whereas the Bayesian model more strongly supports small probabilities (see also Figure 3). The same effect of the evidence theory can be observed in real images, where the Bayesian model does not segment parts of the object, because one feature channel is approximately zero (e.g. Figure 1, or the tent in Figure 7). Conversely, the proposed method converges to a good segmentation in these regions because the two other channels strongly support the foreground. To evaluate and demonstrate the impact of the new method, we choose 47 images from the Berkeley segmentation dataset [23]. This subset was chosen for practical reasons, because most of the images are not suited for variational image
436
B. Scheuermann and B. Rosenhahn
segmentation. To measure the performance we calculated precision and recall for each of the images, which are defined by: P recision =
|G ∩ R| , |G ∩ R| + |R \ G|
Recall =
|G ∩ R| , |G ∩ R| + |G \ R|
(27)
where G is the ground-truth foreground object and the region R is the objectsegment derived from the segmentation. As ground-truth foreground we used the manual segmented ground-truth from [25], which is publicly available. For both frameworks we used the same manually selected initialization for each image, which was typically a rectangular inside the object. We decided to use manually segmentations, since both methods find local minimizers of the given energy functional. The regularization parameter was chosen slightly different between the frameworks but remained the same for all images. The probability densities where modeled as nonparametric Parzen estimates. The performance of all images is shown in Figure 5. Using the Bayesian framework to fuse the information leads to a mean precision of 0.81 and a mean recall of 0.82 while the mean precision using the proposed Dempster-Shafer theory is 0.92 and the mean recall is 0.83. Figures 7 and 1 show some example segmentations on images from the Berkeley segmentation dataset. Using our proposed method to fuse the different information helps to segment and separate much better the semantically interesting and different regions by searching for features that support the foreground region or the background region. Analyzing the lower right Example of Figure 7 shows, that the Bayesian model segment parts of the swamp, because on channel does not support the background. Sometimes the proposed model leads to slightly worse segmentations. E.g. analyzing the flower, some parts of the blossoms are not segmented. The reason why the Bayesian model segments these parts is one channel that does not support the background. Example segmentations integrating texture features are given in Figure 6. Again the proposed method separates the the interesting regions much better than the Bayesian approach.
(a) Bayes
(b) Dempster-Shafer
(c) Bayes
(d) Dempster-Shafer
Fig. 6. Segmentation results using either the probabilistic Bayesian model or Dempster-Shafer evidence theory to fuse color and texture information
The Dempster-Shafer Evidence Theory for Image Segmentation
437
Fig. 7. Images taken from the Berkeley dataset [23]. The left image shows the segmentation results using the Bayesian model to fuse the CIELAB-color-channels, the right image shows the results using Dempster-Shafer theory of evidence to fuse the information. Each example shows that the evidence theory can help to improve the segmentation.
5
Conclusion
In this paper we proposed the Dempster-Shafer evidence theory as an extension of the Bayesian model for the task of level set based image segmentation. The main property of this theory is to combine information arising from different feature
438
B. Scheuermann and B. Rosenhahn
channels by modeling inaccuracy and uncertainty at the same time. It therefore allows to fuse these information and resolve conflicts. In several experiments on real and synthetic images we demonstrated the properties and advantages of using the Dempster-Shafer evidence theory for image segmentation. We strongly believe that this theory is of high interest for other applications in the fields of Computer Vision or Machine Learning as well. Acknowledgement. This work is partially funded by the German Research Foundation (RO 2497/6-1).
References 1. Mumford, D., Shah, J.: Boundary detection by minimizing functionals. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, pp. 22–26. IEEE Computer Society Press, Springer (1985) 2. Chan, T., Vese, L.: Active contours without edges. IEEE Transactions on Image Processing 10, 266–277 (2001) 3. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour models. International Journal of Computer Vision 1, 321–331 (1988) 4. Osher, S., Paragios, N.: Geometric level set method in imaging, vision, and graphics. Springer, Heidelberg (2003) 5. Zhu, S.C., Yuille, A.: Region competition: unifying snakes, region growing, and bayes/mdl for multiband image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence 18, 884–900 (1996) 6. Rousson, M., Brox, T., Deriche, R.: Active unsupervised texture segmentation on a diffusion based feature space. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison, WI, pp. 699–704 (2003) 7. Cremers, D., Soatto, S.: Motion competition: A variational approach to piecewise parametric motion segmentation. International Journal of Computer Vision 62, 249–265 (2004) 8. Malladi, R., Sethian, J.A., Vemuri, B.C.: Shape modeling with front propagation: A level set approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 158–175 (1995) 9. Mena, J.B., Malpica, J.A.: Color image segmentation using the dempster-shafer theory of evidence for the fusion. In: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (34) 3/W8, pp. 139–144 (2003) 10. Chaabane, S., Sayadi, M., Fnaiech, F., Brassart, E.: Color image segmentation based on dempster-shafer evidence theory. In: 14th IEEE Mediterranean Electrotechnical Conference, pp. 862–866 (2008) 11. Chaabane, S.B., Sayadi, M., Fnaiech, F., Brassart, E.: Dempster-Shafer evidence theory for image segmentation: application in cells images. International Journal of Signal Processing (2009) 12. Adamek, T., O’Connor, N.E.: Using Dempster-Shafer Theory to Fuse Multiple Information Sources in Region-Based Segmentation. In: 2007 IEEE International Conference on Image Processing, pp. 269–272. IEEE, Los Alamitos (2007) 13. Rombaut, M., Zhu, Y.: Study of Dempster–Shafer theory for image segmentation applications. Image and Vision Computing 20, 15–23 (2002) 14. Dempster, A.P.: A generalization of bayesian inference. Journal of the Royal Statistical Society. Series B (Methodological) 30, 205–247 (1968)
The Dempster-Shafer Evidence Theory for Image Segmentation
439
15. Shafer, G.: A mathematical theory of evidence. Princeton University Press, Princeton (1976) 16. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level set segmentation: integrating color, texture, motion and shape. International Journal of Computer Vision 72, 195–215 (2007) 17. Cremers, D., Schmidt, F.R., Barthel, F.: Shape priors in variational image segmentation: Convexity, lipschitz continuity and globally optimal solutions. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2008) 18. Paragios, N., Rousson, M., Ramesh, V.: Matching distance functions: A shape-toarea variational approach for global-to-local registration. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 775–789. Springer, Heidelberg (2002) 19. Heiler, M., Schn¨ orr, C.: Natural image statistics for natural image segmentation. International Journal of Computer Vision 63, 5–19 (2005) 20. Chan, T., Sandberg, B.Y., Vese, L.: Active contours without edges for vector-valued images. Journal of Visual Communication and Image Representation 11, 130–141 (2000) 21. Brox, T., Weickert, J.: A tv flow based local scale estimate and its application to texture discrimination. Journal of Visual Communication and Image Representation 17, 1053–1073 (2006) 22. Rousson, M., Paragios, N.: Shape priors for level set representations. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 78–92. Springer, Heidelberg (2002) 23. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int’l Conf. Computer Vision., vol. 2, pp. 416–423 (2001) 24. Haindl, M., Mikeˇs, S.: Texture segmentation benchmark. In: Proceedings of the 19th International Conference on Pattern Recognition, ICPR, pp. 1–4 (2008) 25. McGuinness, K., O’Connor, N.E.: A comparative evaluation of interactive segmentation algorithms. Pattern Recognition 43, 434–444 (2010)
Gait Analysis of Gender and Age Using a Large-Scale Multi-view Gait Database Yasushi Makihara, Hidetoshi Mannami, and Yasushi Yagi Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan
Abstract. This paper describes video-based gait feature analysis for gender and age classification using a large-scale multi-view gait database. First, we constructed a large-scale multi-view gait database in terms of the number of subjects (168 people), the diversity of gender and age (88 males and 80 females between 4 and 75 years old), and the number of observed views (25 views) using a multi-view synchronous gait capturing system. Next, classification experiments with four classes, namely children, adult males, adult females, and the elderly were conducted to clarify view impact on classification performance. Finally, we analyzed the uniqueness of the gait features for each class for several typical views to acquire insight into gait differences among genders and age classes from a computer-vision point of view. In addition to insights consistent with previous works, we also obtained novel insights into view-dependent gait feature differences among gender and age classes as a result of the analysis.
1
Introduction
Gait is a basic and well-trained human behavior that includes both static (e.g., height, body shape) and dynamic (e.g., stride, arm swing, walking posture) components and has attracted much interest in terms of its application in, for example, person identification[1][2][3][4][5][6], prognosis of surgery[7], and diagnosis of brain disorders[8]. In addition, the classification of gender and age has raised expectations of the potential application of gait to areas such as detection of wandering elderly in a care center and collection of customer information at shopping malls. Some studies have challenged gender classification[9][10][11] and estimation of age [12][13]. Although these studies reached reasonable conclusions, the databases used considered only one aspect, either gender or age, and included images captured from only a few views. To deal with the challenges involved, a multi-view analysis of a database that includes a wide range of genders, ages, and views is required. Using such a database, analysis of both gender and age, which has not been undertaken in previous works, would also be possible. In this paper, we address video-based multi-view classification and analysis of gender and age classes using a large-scale multi-view gait database. Our main contributions are as follows. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 440–451, 2011. Springer-Verlag Berlin Heidelberg 2011
Gait Analysis of Gender and Age Using a Multi-view GaitDB
441
The large-scale multi-view gait database. A large population of subjects (168 people), including both genders (88 males and 80 females) and a wide range of ages (between 4 and 75 years old), enables us to study and evaluate applications that require a diverse group of subjects. Moreover, with images from 25 different views (2 heights × 12 azimuth directions, plus an overhead view), multi-view applications can be supported. Feature analysis for gender and age classes. The uniqueness of gait features for four typical classes, namely children, adult males, adult females, and the elderly, is analyzed with respect to view, and static and dynamic components. Frequency-domain features[14], representing average silhouette, and asymmetric and symmetric motion are exploited for analysis from a computer-vision point of view, since these capture effectively the 2D spatial information of the dynamic, as well as static components of gait in a more compact representation than a raw silhouette sequence[5]. We show that some insights from our feature analysis of gender and age classes are consistent with those from other gait research studies, and that some new insights are also obtained as a result of our large-scale and multi-view feature analysis. The remainder of this paper is organized as follows. Section 2 summarizes related works. Section 3 describes the gait capturing system and the constructed gait database. Gait features used in this analysis are explained in Section 4. Section 5 presents the classification experiments for gender and age classes. Differences in the classes are analyzed in Section 6, while Section 7 compares these differences to the results of the experiments. Finally, Section 8 concludes the paper.
2
Related Work
Gait-based gender and age classification. In gender classification, Kozlowski et al.[15] introduced the possibility of automatic gender classification using subjective tests. Li et al.[10] confirmed the possibility of gender classification through experiments with the HumanID dataset[5], the subjects of which were mainly in their 20’s. Various studies have also investigated the possibility of age classification. Davis[13] presented a classification of children (3-5 years old) and adults (30-52 years old), while Begg et al.[12] presented a similar classification of adults (avg. 28.4, SD 6.4) and the elderly (avg. 69.2, SD 5.1). These two studies used comparatively small samples of gender and age, but more importantly, they considered only one aspect, either gender or age. Gait features. With respect to gait features, Kozlowski et al.[15] and Davis[13] used light points attached to a subject’s body, while Begg[12] proposed the Minimum Foot Clearance, defined as the distance between the ground surface and the foot. Since all these methods require subjects to wear various devices, they are not suitable for real applications. On the other hand, various methods using gait images as contact-less information, have also been proposed. These can
442
Y. Makihara, H. Mannami, and Y. Yagi
roughly be divided into model-based and appearance-based methods. In modelbased methods, body parts are represented as various shapes (e.g., the ellipse model[16] and stick model[17]) and gait features are extracted as parameters of these shapes. These model-based methods often run into difficulties with model fitting, and their computational costs are comparatively high. Of the appearancebased methods, the width vector[3], frieze pattern[18], and self similarity plot[1] discard 2D spatial information, while the averaged silhouette[19][20][10] degrades dynamic information. On the other hand, a raw silhouette sequence[5] contains both 2D spatial and dynamic information, but lacks compactness of representation. Although most of the appearance-based methods treat only side-view or near side-view gait images, gait features captured from other views are also informative (e.g., gait-based person identification[21] and action recognition[22]). Huang et al.[9] proposed a multi-view gender classification, but the number of views was insufficient for an extensive analysis of multi-view effects. Gait image databases. With respect to gait image databases, there are three large databases (the HumanID dataset[5], the Soton database[23], and the CASIA dataset[24]) that include over 100 subjects each. These are widely used in studies on gait-based person identification and gender classification. Each database, however, has its own particular limitations making it unsuitable for our purposes. The HumanID dataset[5] has relatively small view changes (about 30 deg.) and includes neither children nor the elderly. The Soton database[25] contains only single view images, while most of the subjects included in the CASIA dataset[24] are in their 20’s or 30’s.
3 3.1
The Gait Database Multi-view Synchronous Gait Capturing System
Our image capturing system consists mainly of a treadmill, 25 synchronous cameras (2 layers of 12 encircling cameras and an overhead camera with a mirror), and six screens surrounding the treadmill, as shown in Fig. 1. The frame rate and resolution of each camera are set to 60 fps and VGA, respectively. The surrounding screens are used as a chroma-key background. Sample images captured in the system are also shown in Fig. 1. 3.2
Large-Scale Gait Database for Gender and Age
For analyzing gender and age effects on gait, the database used must include subjects with a wide range of ages for each gender. We aimed to collect at least 10 subjects of each gender for every 10 year age group. Subjects were obtained by open recruitment and signed a statement of consent regarding the use of their images for research purposes. As a result, our database includes 168 subjects (88 male and 80 female), with ages ranging from 4 to 75 years old, (see Fig. 2 for the age distribution). A portion of the database has already been made available to the public[26]. The number of subjects exceeds that of the previous
Gait Analysis of Gender and Age Using a Multi-view GaitDB
443
Fig. 1. Overview of multi-view synchronous gait capturing system
㩺㫊㫌㪹㫁㪼㪺㫋㫊
㪈㪇 㪐 㪏 㪎 㪍
㪤㪼㫅 㪮㫆㫄㪼㫅
㪌 㪋 㪊 㪉 㪈 㪇 㪄㪈㪇
㪈㪇㪄 㪈㪌㪄 㪉㪇㪄 㪈㪋 㪈㪐 㪉㪋
㪉㪌㪄 㪊㪇㪄 㪉㪐 㪊㪋
㪊㪌㪄 㪋㪇㪄 㪋㪌㪄 㪊㪐 㪋㪋 㪋㪐 㪸㪾㪼
㪌㪇㪄 㪌㪌㪄 㪍㪇㪄 㪌㪋 㪌㪐 㪍㪋
㪍㪌㪄
Fig. 2. Distribution of subjects’ gender and age
largest database and is still growing in the light of further research. Due to the reported difference between walking on the ground and on a treadmill [27], we attempted to minimize any potential difference by providing enough time for each subject to practice walking on the treadmill. After the practice sessions, subjects were asked to walk at 4 km/h or slower if necessary for children and the elderly. Subjects wore standard clothing (long-sleeved shirts and long pants, or their own casual clothes).
4
Gait Feature Extraction
In this section, we describe the gait feature extraction from the captured images. Here we define gait as a cycle of walking that includes a pair of left and right heel-strikes.
444
Y. Makihara, H. Mannami, and Y. Yagi
0
1
2
CNGHVXKGY
0
1
2
DHTQPVCNXKGY
0
1
2
times freq.
EQXGTJGCFXKGY
Fig. 3. Samples of frequency-domain features captured by each camera. In this paper, frequencies between 0 and 2 are used, where each frequency in side view, for example, denotes the following: 0-times represents the average silhouette, 1-time represents the asymmetry of the left and right motion, and 2-times represents the symmetry thereof.
Since our purpose is the analysis of visual appearance, it is preferable to use a gait feature that captures 2D spatial information for both static and dynamic features. Consequently, we selected a frequency-domain feature[14] because, 1) it contains both 2D spatial and dynamic information, and 2) it can be expressed in a more compact form than a raw silhouette sequence, as shown in the three frequency-channel images in Fig.3. The procedure for frequency-domain feature extraction comprises four steps: 1. Construction of the Gait Silhouette Volume (GSV) by size normalization and registration of silhouette images (64 by 44 pixels in this paper). 2. Gait period detection by maximizing normalized autocorrelation of the GSV for the temporal axis. 3. Amplitude spectra calculation through a one-dimensional Discrete Fourier Transformation (DFT) of the temporal axis, with the detected gait period used as the base period for the DFT. 4. Application of Singular Value Decomposition (SVD) for dimension reduction of the frequency-domain feature. A more detailed description of this process can be found in [14]. While the actual height information is obviously important for discriminating children from other age groups, the estimation process of the actual height needs camera calibration in advance. Because not all CCTV cameras in the world are calibrated, the actual height acquisition may be an unreasonable assumption. We assume the more challenging problem of classifying gender and age using an uncalibrated camera, with height-normalized gait features.
5 5.1
Preliminary Experiments on Gender and Age Classification Method
We carried out a classification experiment of four typical gender and age classes, namely children (younger than 15 years old), adult male and adult female (male and females between 15 and 65 years old, respectively), and the elderly (aged 65
Gait Analysis of Gender and Age Using a Multi-view GaitDB
445
years and older). Classification performance is given for each view. For each classification experiment, 20 subjects of each class were randomly chosen from both the gallery and test (probe) images. The k Nearest-Neighbors (k-NN) algorithm was used as the classifier in LDA space[28], with the parameter of k-NN set to k = 3 based on preliminary experiments. Finally, the Correct Classification Rate (CCR) was calculated from 20 combinations of trials and the averaged values used in the evaluation. 5.2
Classification of the Four Classes
Classification performance for the four classes was evaluated independently for each view. According to the results shown in Fig. 4(a), the performance of different elevation angles shows little difference, while the performance of the overhead view is similar to the average performance of other views. In cases where weak perspective can be assumed, opposite-view cameras give horizontally inverted silhouette images[29], resulting in the same CCR. On the other hand, this assumption is not true for our gait measurement system, as the distance from the subject to the camera is not long enough compared to the width of the subject. Therefore, opposite-view cameras do not always achieve the same CCR as shown in Fig. 4(a). 5.3
Classification of Specific Combinations of Classes
The classification performance of each view feature was evaluated with respect to specific combinations of classes, namely children and adults (C-A), adult
㪚㪄㪘
㪘㫑㫀㫄㫌㫋㪿㩷㫍㫀㪼㫎㩷㪲㪻㪼㪾㪴
(a) CCR for each camera for the four classes
㪊㪊㪇
㪊㪇㪇
㪉㪎㪇
㪉㪋㪇
㪉㪈㪇
㪈㪏㪇
㪈㪌㪇
㪐㪇
㪈㪉㪇
㪊㪊㪇
㪊㪇㪇
㪉㪎㪇
㪉㪋㪇
㪉㪈㪇
㪈㪏㪇
㪈㪌㪇
㪐㪇
㪈㪉㪇
㪍㪇
㪇
㪊㪇
㪇㪅㪍㪇
㪍㪇
㪇㪅㪍㪌
㪇
㪚㪚㪩
㪇㪅㪎㪇
㪇㪅㪍㪌 㪇㪅㪍㪇 㪇㪅㪌㪌 㪇㪅㪌㪇 㪇㪅㪋㪌 㪇㪅㪋㪇 㪊㪇
㪚㪚㪩
㪇㪅㪎㪌
㫆㫍㪼㫉㪿㪼㪸㪻
㫆㫍㪼㫉㪿㪼㪸㪻
㫃㪸㫐㪼㫉㪉
㫃㪸㫐㪼㫉㪈 㪇㪅㪏㪇
㪘㪄㪜
㪘㪤㪄㪘㪝
㪇㪅㪐㪇 㪇㪅㪏㪌 㪇㪅㪏㪇 㪇㪅㪎㪌 㪇㪅㪎㪇
㪘㫑㫀㫄㫌㫋㪿㩷㫍㫀㪼㫎㩷㪲㪻㪼㪾㪴
(b) CCR for each camera for specific combinations of classes
Fig. 4. Performance of gender and age classification for each view. The horizontal axis denotes the azimuth view from the subject, where front corresponds to 0 deg., right side to 90 deg., and so on. The vertical axis denotes CCR. In (a), the separate graphs depict the cameras in layer 1, layer 2, and the overhead camera. In (b), results of layer 1 and the overhead view are shown.
446
Y. Makihara, H. Mannami, and Y. Yagi
males and adult females (AM-AF), and adults and the elderly (A-E). Here the number of gallery and probe images was set as 10 for each. Figure 4(b) shows the classification results for each view. We can see that view effects on the performance are significant and therefore, multi-view analyses of gait features are meaningful.
6
Feature Analysis of Gender and Age Classes
In this section, we analyze the uniqueness of gait for each class. Figure 5 illustrates the average gait features for each class for typical views (side, front, right-back, and overhead), where views from layer 1 cameras only are shown, since the elevation angle causes little difference as discussed in the previous section. Figure 6 shows a comparison of these features for combinations of classes used in the classification experiments depicted in Fig. 4(b) for the same views as in Fig. 5. According to the above, the following trends are observed for each class. Children Posture, arm swing, and relative size of the head are specific features for children. Regarding posture, children tend to walk with their bodies tilted forwards, because they are often looking down at the ground, and this trend is distinctive in the side and overhead views. The captured area is larger in the overhead view compared with that of a straight posture, while the tilted angle can easily be observed in the side view. The arm swings of children tend to be smaller than those of adults since walking in children is less mature.
C
AM
AF
E (a)left
(b)front
(c)right-back
(d)overhead
Fig. 5. Average gait features for each class. The features are shown with their 1- and 2-times frequency multiplied 3 times for highlighting.
Gait Analysis of Gender and Age Using a Multi-view GaitDB
447
C-A
AM-AF
A-E (a) Left
(b) Front
(c) Right-back
(d) Overhead
Fig. 6. Differences in average features. Color is used to denote which class’s feature appears more strongly. Red indicates that the feature of the leftmost class (e.g. C of C-A) appears more strongly, while green depicts the opposite. The features are shown with their 1- and 2-times frequency multiplied 3 times for highlighting.
This feature is observed in the side and overhead views. Since the relative head size of children is larger than that of adults, children have larger heads and smaller bodies as a volume percentage against the whole body. These features are observed especially in the frontal views; arms appear lower, the size of the head area is larger and the body shape is wider. Adult males and females The main differences between the genders are observed in the body frame and stride. Regarding body frame, males have wider shoulders while females have more rounded bodies. These features are particularly noticeable in the frontal and side views. Differences are observed in the shoulder area in the frontal view, and in the breast and hair area in side views. Since males are taller than females on average, the walking stride of a male appears narrower in a side view when the images showing individuals walking at constant speed are normalized with respect to height. Related to physical strength, while the walking motion of adults is well balanced and more symmetrical, that of children and the elderly tends to be more asymmetrical. Therefore, adults have much lower values in the 1-time frequency in the side view, which represents asymmetry of gait, compared with children or the elderly. This asymmetry plays an important role in age classification. The elderly Body width, walking posture, and arm swings are distinctive features of the elderly. Due to middle-age spread, they have wider bodies than adults, which is clearly observed in the frontal view. This feature is, however, different from the similar feature in children. The elderly have wider bodies, but the size of their head is almost the same as in adults. Posture is discriminative in side views, which show a stooped posture. It is also observed that the elderly have larger arm swings. The reason seems to be that elderly subjects consciously
448
Y. Makihara, H. Mannami, and Y. Yagi
swing their arms in the same way as brisk walkers exercising. In fact, the interviews confirmed that most of the elderly subjects do some form of daily exercise, notably walking. These trends also explain the classification results. According to Fig. 4(b), the performance of the overhead view for AM-AF and A-E is not as high as that of most other views. The reason is obvious from Fig. 5 in that gait features such as leg movement and posture, are not apparent in this view. However, in the children-adults classification, the relative performance of the overhead view to other views is better compared with the other classifications, as the dominant differences are the tilting posture and size of head. This indicates that the effect of the view on classification performance may differ depending on the class.
7
Discussion
Comparison with observations from previous works. Observations from previous works include results showing significant differences in the stride and breast area between genders[10], and that the difference in stride contributes to gender classification[17]. These observations are also evident in our research as shown in Fig. 6. Moreover, quite similar observations of gait feature differences between genders are reported in [11]. Yu et al.[11] analyzed the gender difference in Gait Energy Image (GEI)[20], corresponding to the 0-times component of our frequency-domain feature, and calculated the ANOVA F-statistic value between genders as shown in Fig. 7(a). The ANOVA F-statistic[30] is a measure for evaluating the discriminative capability of different features. The greater the F value, the better the discriminative capability. Because the difference at 0-times frequency (Fig. 7(b)) has similar properties to the F value of the GEI, a similar part of the image dominates in both approaches, that is, hair, back, breast area, and legs, which enhances the validity of our results. Note that our frequency-domain feature provides additional information on the dynamic component (1-time and 2-times frequency), as well as the static component, as shown in Fig. 7(b)). In addition, we obtained further insights, such as view dependencies of classification performance and observed gait features, which previous works did not analyze. Consequently, our analyses are more comprehensive than those of previous works with respect to gender and age, dynamic components, and view. Classification performance. The classification results show that the more diverse the subjects are in terms of gender and age, the more difficult the classification becomes. Though our classification results were 80% for AM-AF and 74% for C-A, the accuracy improves to 91% for AM-AF and 94% for C-A if the adult age range is limited to between 25 and 34 years. This improved accuracy correlates with that obtained for other state-of-the-art systems, 90% for AMAF[9], 95.97% for AM-AF[11], and 93-95% for C-A[13], although the detailed distribution and the number of subjects differ. This indicates that current methods could be improved to realize some feasible applications such as detecting lost children in airports and collecting customer statistics for stores.
Gait Analysis of Gender and Age Using a Multi-view GaitDB
(a) F-statistic for GEI [11]
449
(b) Difference in frequency-domain features (ours)
Fig. 7. Comparison of differences between genders
Application of view dependency. Based on our research, CCRs obtained with individual single views differ each othera whole-view feature are superior to the average CCRs obtained with an individual single view. Thus, classification performance can be improved by selecting an optimal combination of views for real applications. For example, combining the overhead and side views will ensure good performance in the classification of children, whereas the installation of overhead cameras seems to be feasible for surveillance purposes, making it a good candidate for detecting lost children.
8
Conclusion
We have described video-based gait feature analysis for gender and age classification using a large-scale multi-view gait database. We started by constructing a large-scale multi-view gait database using a multi-view synchronous gait capturing system. The novelty of the database is not only its size, but also the fact that it includes a wide range of gender and age combinations (88 males and 80 females between 4 and 75 years old), and a large number of observed views (25 views). Gait features for four classes, namely children (younger than 15 years old), adult males (15 to 65 years old), adult females (15 to 65 years old) and the elderly (65 years and older), were analyzed in terms of views, and static and dynamic components, and insights regarding view changes and classification were acquired from a computer-vision point of view. A part of our gait database have been already published on a website[26] and new gait database will be added in the future Future directions for this research are as follows: – Modeling the aging effect to extend the application of gait, for example, to person identification across a number of years, and the synthesis of subjects’ gait for any given age. – Optimal combination of views for better gender and age classification considering the obtained view dependency on the classification.
450
Y. Makihara, H. Mannami, and Y. Yagi
Acknowledgement. This work was supported by Grant-in-Aid for Scientific Research(S) 21220003.
References 1. BenAbdelkader, C., Culter, R., Nanda, H., Davis, L.: Eigengait: Motion-based recognition people using image self-similarity. In: Proc. of Int. Conf. on Audio and Video-based Person Authentication, pp. 284–294 (2001) 2. Bobick, A., Johnson, A.: Gait recognition using static activity-specific parameters. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 423–430 (2001) 3. Cuntoor, N., Kale, A., Chellappa, R.: Combining multiple evidences for gait recognition. In: Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 3, pp. 33–36 (2003) 4. Horita, Y., Ito, S., Kanade, K., Nanri, T., Shimahata, Y., Taura, K., Otake, M., Sato, T., Otsu, N.: High precision gait recognition using a large-scale pc cluster. In: Proc. of the IFIP International Conference on Network and Parallel Computing, pp. 50–56 (2006) 5. Sarkar, S., Phillips, J., Liu, Z., Vega, I., Grother, P., Bowyer, K.: The humanid gait challenge problem: Data sets, performance, and analysis. Trans. of Pattern Analysis and Machine Intelligence 27, 162–177 (2005) 6. Liu, Z., Sarkar, S.: Improved gait recognition by gait dynamics normalization. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 863–876 (2006) 7. Su, F., Wu, W., Cheng, Y., Chou, Y.: Fuzzy clustering of gait patterns of patients after ankle arthorodesis based on kinematic parameters. Medical Engineering and Physics 23, 83–90 (2001) 8. Liston, R., Mickelborough, J., Bene, J., Tallis, R.: A new classification of higher level gait disorders in patients with cerebral multi-infarct states. Age and Ageing 32, 252–258 (2003) 9. Huang, G., Wang, Y.: Gender classification based on fusion of multi-view gait sequences. In: Proc. of the 8th Asian Conf. on Computer Vision, vol. 1, pp. 462–471 (2007) 10. Li, X., Maybank, S., Yan, S., Tao, D., Xu, D.: Gait components and their application to gender recognition. Trans. on Systems, Man, and Cybernetics, Part C 38, 145–155 (2008) 11. Yu, S., Tan, T., Huang, K., Jia, K., Wu, X.: A study on gait-based gender classification. IEEE Trans. on Image Processing 18, 1905–1910 (2009) 12. Begg, R.: Support vector machines for automated gait classification. IEEE Trans. on Biomedical Engineering 52, 828–838 (2005) 13. Davis, J.: Visual categorization of children and adult walking styles. In: Proc. of the Int. Conf. on Audio- and Video-based Biometric Person Authentication, pp. 295–300 (2001) 14. Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y.: Gait recognition using a view transformation model in the frequency domain. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 151–163. Springer, Heidelberg (2006) 15. Kozlowski, L., Cutting, J.: Recognizing the sex of a walker from a dynamic pointlight display. Perception and Psychophysics 21, 575–580 (1977)
Gait Analysis of Gender and Age Using a Multi-view GaitDB
451
16. Lee, L., Grimson, W.: Gait analysis for recognition and classification. In: Proc. of the 5th IEEE Conf. on Face and Gesture Recognition, vol. 1, pp. 155–161 (2002) 17. Yoo, J., Hwang, D., Nixon, M.: Gender classification in human gait using support vector machine. In: Advanced Concepts For Intelligent Vision Systems, pp. 138–145 (2006) 18. Liu, Y., Collins, R., Tsin, Y.: Gait sequence analysis using frieze patterns. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 657–671. Springer, Heidelberg (2002) 19. Veres, G., Gordon, L., Carter, J., Nixon, M.: What image information is important in silhouette-based gait recognition? In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 776–782 (2004) 20. Han, J., Bhanu, B.: Individual recognition using gait energy image. Trans. on Pattern Analysis and Machine Intelligence 28, 316–322 (2006) 21. Sugiura, K., Makihara, Y., Yagi, Y.: Gait identification based on multi-view observations using omnidirectional camera. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part I. LNCS, vol. 4843, pp. 452–461. Springer, Heidelberg (2007) 22. Weinland, D., Ronfard, R., Boyer, E.: Automatic discovery of action taxonomies from multiple views. In: Proc. of the 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, New York, USA, vol. 2, pp. 1639–1645 (2006) 23. Shutler, J., Grant, M., Nixon, M., Carter, J.: On a large sequence-based human gait database. In: Proc. of the 4th Int. Conf. on Recent Advances in Soft Computing, Nottingham, UK, pp. 66–71 (2002) 24. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In: Proc. of the 18th Int. Conf. on Pattern Recognition, Hong Kong, China, vol. 4, pp. 441–444 (2006) 25. Nixon, M., Carter, J., Shutler, J., Grant, M.: Experimental plan for automatic gait recognition. Technical report, Southampton (2001) 26. Ou-isir gait database, http://www.am.sanken.osaka-u.ac.jp/GaitDB/index.html 27. Lee, S.J., Hidler, J.: Biomechanics of overground vs. treadmill walking in healthy individuals. Journal of Applied Physiology 104, 747–755 (2008) 28. Otsu, N.: Optimal linear and nonlinear solutions for least-square discriminant feature extraction. In: Proc. of the 6th Int. Conf. on Pattern Recognition, pp. 557–560 (1982) 29. Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y.: Which reference view is effective for gait identification using a view transformation model? In: Proc. of the IEEE Computer Society Workshop on Biometrics 2006, New York, USA (2006) 30. Brown, M.B., Forsythe, A.B.: The small sample behavior of some statistics which test the equality of several means. Technometrics 16, 129–132 (1974)
A System for Colorectal Tumor Classification in Magnifying Endoscopic NBI Images Toru Tamaki, Junki Yoshimuta, Takahishi Takeda, Bisser Raytchev, Kazufumi Kaneda, Shigeto Yoshida, Yoshito Takemura, and Shinji Tanaka Hiroshima University
Abstract. In this paper we propose a recognition system for classifying NBI images of colorectal tumors into three types (A, B, and C3) of structures of microvessels on the colorectal surface. These types have a strong correlation with histologic diagnosis: hyperplasias (HP), tubular adenomas (TA), and carcinomas with massive submucosal invasion (SM-m). Images are represented by Bag-of-features of the SIFT descriptors densely sampled on a grid, and then classified by an SVM with an RBF kernel. A dataset of 907 NBI images were used for experiments with 10-fold cross-validation, and recognition rate of 94.1% were obtained.
1
Introduction
In this paper, we report results of our prototype system for automatic classification of colorectal cancer in images taken by a Narrow-band Imaging (NBI) system. Our system consists of a Bag-of-features representation of images with densely sampled multi-scale SIFT descriptors and an SVM classifier with an RBF kernel. In the following subsections, we describe the motivation and importance of the system and related work. Then, the details of the system and experiments are explained in sections 2 and 3. 1.1
Motivation
In 2008, colorectal cancer was the third-leading cause of cancer death in Japan, and the leading cause for death for Japanese women over the last 6 years [1]. In the United States, a report estimates that 49,920 people have died from colorectal cancer in 2009 [2]. WHO has released projections [3] in which the number of deaths in the world is estimated to be about 780,000 in 2015, and is expected to rise to 950,000 in 2030. Therefore, it is very important to develop computerized systems able to provide supporting diagnosis for this type of cancer. An early detection of colorectal tumors by colonoscopy, an endoscopic examination of the colon, is important and widely used in many hospitals. During colonoscopy, lesions of colorectal tumors are taken from the colon surface, and then histologically analyzed into the following groups [4–6]: hyperplasias (HP), tubular adenomas (TA), carcinomas with intramucosal invasion to scanty submucosal invasion (M-SM-s), carcinomas with massive submucosal invasion (SM-m). R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 452–463, 2011. Springer-Verlag Berlin Heidelberg 2011
A System for Colorectal Tumor Classification
A type
Microvessels are not observed or extremely opaque.
B type
Fine microvessels are observed around pits, and clear pits can be observed via the nest of microvessels.
C type
1
Microvessels comprise an irregular network, pits observed via the microvessels are slightly non-distinct, and vessel diameter or distribution is homogeneous.
2
Microvessels comprise an irregular network, pits observed via the microvessels are irregular, and vessel diameter or distribution is heterogeneous.
3
Pits via the microvessels are invisible, irregular vessel diameter is thick, or the vessel distribution is heterogeneous, and avascular areas are observed.
453
Fig. 1. Classification of NBI magnification for colorectal tumors [6]
HP are benign and not removed because of the risk of bleeding the colonal surface. TA are likely to develop into cancer and sometimes removed during the colonoscopy. M-SM-s are cancers and removed during colonoscopy, while SMm should be treated surgically because they are cancers that deeply invade the colonal surface, and it is difficult to remove them completely by colonoscopy. For minimizing the risk of bleeding the colon and leaving the cancers, it is desirable to determine if the tumor is a SM-m cancer or not before the lesions are taken. To identify histologic diagnosis by using visual appearance of colorectal tumors in endoscopic images, the pit-pattern classification scheme [14, 15] has been used. During a chromoendoscopy, indigo carmine dye spraying or crystal violet staining are used to enhance microscopic appearances of colorectal tumors. Pit patterns are textures of such an enhanced surface, categorized into types I to V. However, a chromoendoscopy requires a cost for dye spraying and prolongs the examination procedure, which leads to the following emerging device. NBI is a recently developed videoendoscopic system that uses three optical filters of red-green-blue sequential illuminations to narrow the bandwidth of the spectral transmittance. NBI provides a limited penetration of light to the mucosal surface and enhances the microvessels and their fine structure on the colorectal surface (see Figure 2). The advantage of NBI is that it allows an
454
T. Tamaki et al.
(a)
(b) Fig. 2. Examples of NBI images of types A (a), B (b), and C3 (c)
A System for Colorectal Tumor Classification
455
(c) Fig. 2. (Continued)
operator to quickly switch a normal image to an NBI image for examining tumors that have characteristic structures of microvessels, while a chromoendoscopy needs dye-spraying. Several categorizations of NBI magnification findings, diagnosis based on magnifying NBI endoscopic images, have been recently developed. Among them, in this paper we use the classification proposed by [6] that divides types of microvessel architectures in an NBI image into type A, B, and C (see Fig. 1). Type C is further divided into three subtypes (C1, C2, and C3) according to detailed texture. This classification has been shown to have a strong correlation with histologic diagnosis [6]. 80% of type A corresponded to HP, and 20% to TA. 79.7% of type B corresponded to TA, and 20.3% to M-SM-s. For type C3, on the other hand, 100% were of SM-m. Therefore, it is very important to detect type C3 among other types, not just as a two-class problem that divides samples into type A and B. 1.2
Related Work
Previously several research works have been reported on pit-pattern [7–13], and NBI images [16, 17]. Existing approaches for classifying chromoendoscopy images using the pitpattern classification include: texture analysis with wavelet transforms [7, 9], Gabor wavelets [8], histogram [10], color wavelets followed by LDA [11] and
456
T. Tamaki et al.
neural networks [12]. Despite the extensive development by these works, chromoendoscopy has gradually become less important because of the spread of NBI endoscopy. In contrast, only a few attempts have been made to classify NBI images of colorectal polyps [16, 17]. Stehle et al. [16] segmented blood vessel regions to extract shape features. Gross et al. [17] used Local Binary Patterns as vessel features. However, these works exhibit the following two problems. First, histologic diagnosis of removed lesions were used for supervised labels. These are usually difficult to collect in a large number of training samples, even though strict diagnosis results can be used for good training labels. Second, only HP and TA were classified as a two-class problem, although, as stated above, SM-m is the most important to be detected. In this paper, we examine a method for classifying NBI images of colorectal tumors into types A, B, and C31 . Unlike [16, 17], we do not extract vessel structures for analyzing NBI image features because blood vessel regions are usually very fine and difficult to be correctly segmented. Moreover, the low contrast of NBI endoscopy images and specular highlight regions on the mucosal surfaces may degrade the image quality and might have wavelet-based features instable. Hence we employ a more robust technique: densely sampled SIFT descriptors [18] used in a bag-of-features framework [19]. Images are represented by bag-offeatures and then classified by an SVM classifier. The classification process is described in more detail in the next section.
2
Feature Representation and Classification Method
In this section, we describe the image features representation and the classification method being used. 2.1
Image Representation
We employ the bag-of-features representation of images (see Figure 3). An image is represented by visual words, a histogram of local features, produced by a hierarchical k-means clustering [20]. Here we use the SIFT descriptor [18] as a local feature. The clustering is performed over all training images to generate K clusters (visual words). As suggested by Bosch et al. [21], SIFT descriptors are computed at points on a regular grid with spacing of M pixels, and also at four different scales (5, 7, 9, and 11) of the local patch centered at the point (see Figure 4). All descriptors of 128 dimensions are simply used by the clustering, unlike [21]. While using randomly selected points for computing the descriptors has been reported to show better result [22], we use a regular grid since the texture in NBI images fills the whole image. 1
We exclude type C1 and C2 because the classification is still an ongoing research area and then the difference between type B and C1, C2 is not so well defined.
A System for Colorectal Tumor Classification
Type B
Type A
«
«
Training
Type C3
« « « 66, 95, 47, , 85 « 6« « 䠖䠖 11, 82, 3, , 124 䠖 «
«
Test image
« « « 65, 33, 19, , 101 « 6« « 䠖䠖 52, 51, 32, , 89 䠖
84, 99, 40, , 121 5, 26, 91, , 150 «
«
ĭ
«
«
«
«
«
Į
Ĭ
Description of Local features
Vector quantization
«
« « « 27, 64, 25, , 87 « 6« « 䠖䠖 93, 41, 75, , 8 䠖
457
į
Vector quantization
İ
Feature space Type A
Type B
Type C3
Ĭ ĭ Įįİ
Ĭ ĭ Įįİ
Classifier Ĭ ĭ Įįİ
Histogram
Ĭ ĭ Įįİ
Classification result
Fig. 3. Bag-of-features representation of images
M: grid space
v1=(f1,1, f1,2, ..., f1,128) v2=(f2,1, f2,2, ..., f2,128) v3=(f3,1, f3,2, ..., f3,128) v4=(f4,1, f4,2, ..., f4,128)
Fig. 4. SIFT descriptors of different scales on a grid
2.2
Classifier
As a classifier, we use a support vector machine (SVM) with a radial basis function (RBF) kernel: k(x, x ) = exp(−γ||x − x ||2 ),
γ > 0.
(1)
458
T. Tamaki et al.
An SVM minimizes the objective function min C
N
1 ξn + ||w||2 , 2 n=1
(2)
where ξn is a slack variable for nth sample, C is the regularization parameter, and ||w||2 is the margin of a hyperplane w. An SVM classifier is trained for different values of C and the scaling parameter γ of the RBF kernel by using five fold cross validation. The ranges of the values are C = 2−5 , 2−3 , . . . , 215 and γ = 2−15 , 2−13 , . . . , 23 , then the values that give the best result are chosen. We use the L2 distance as the metric in the RBF kernel while Zhang et al. [23] reported that the χ2 distance is the best among others. Comparisons using different kernels is left for future investigation. To handle the three class problem, the one-against-one strategy is employed. Although the dataset used in the experiment (described later) is unbalanced, currently we do not use any weights for the SVM classifier.
3 3.1
Experimental Results Dataset
We have collected 907 NBI images (359 for type A, 461 for type B, and 87 for type C3) of colorectal tumors as a dataset for this experiment. Examples of images in the dataset are shown in Figure 2. Every image was collected by medical doctors during NBI colonoscopy examinations with approximately the same lighting conditions and image zooming. Therefore, the microvessel structures of the surface of colorectal tumors can be regarded as being approximately the same size in all images. The captured images were then cropped to a rectangle so that the cropped rectangle contains an area in which typical microvessel structures appear. Therefore, the size of the images is not fixed and varies from about 100 × 300 to 900 × 800. Image labels were given by at least two medical doctors who have good experiences on colorectal cancer diagnosis and are familiar with analysis of pit-pattern and NBI classifications. Note that each image corresponds to a colon polyp obtained from a different subject: no images share the same tumor. 3.2
Evaluation
We evaluate the proposed system by 10-fold cross-validation. The dataset of 907 images is randomly split into 10 groups (i.e., each time randomly 7 images are ignored), and 9 groups are used for training and the rest for test. A recognition rate is the ratio of the number of correctly classified images to the number of all images (900), and the rates of the 10 folds are averaged. Also, we provide recall rates specific to type C3. As we stated in the introduction, images of type C3 should be correctly identified because of the high
A System for Colorectal Tumor Classification
7\SH$·V all features
459
27, 64, 25, «, 87 93, 41, 75, «, 8
«
7\SH$·V)HDWXUHVSDFH
NBI Image Vector quantization
7\SH%·V all features 7\SH%·V)HDWXUHVSDFH
7\SH&·V all features 7\SH&·V)HDWXUHVSDFH
Class-wise concatenate histogram
Fig. 5. Visual words by concatenating histograms obtained for each class
correlation to SM-m cancer. This C3-specific recall rate is defined as the ratio of the number of correctly classified type C3 images to the number of all type C3 images. 3.3
Experimental Results
Figure 6(a) shows the classification results obtained when varying the parameters. The horizontal axis is the number of visual words, and each plot corresponds to different number of the grid spacing M for sampling the SIFT descriptors. This shows that smaller spacing M = 10 is better, which means more descriptors are used for generating visual words. Table 1 shows the confusion matrix of the best performance 93.5% when M = 10 with 1024 visual words. Figure 6(c) shows the C3-specific recall rate. Even though the recognition rate increases with the number of visual words, the C3-specific recall rate almost saturates for more than 64 visual words. Hence, using more visual words does not lead to better performance. Instead, decreasing M might be a good strategy. However, smaller M takes much more computational time and memory. Actually, the huge number of descriptors obtained for M = 5 are difficult to be clustered by a standard k-means procedure. Therefore, we have investigated another way of constructing the visual words as shown in Figure 5: a relatively small number of clusters are extracted for each class and then the histograms obtained for each class are concatenated to make a global visual word [23]. This requires much smaller computational cost, while the standard way described in section 2 needs a global clustering of all descriptors of all classes. Figure 7 shows the results obtained when using the concatenation of classwise histograms. Here we use 25, 125, and 625 clusters per class, hence the total number of visual words is three times larger than that. In this case, the result for M = 10 is comparable to that for M = 5, but still better results are obtained
460
T. Tamaki et al. 1 M=10 M=15
recognition rate
0.95
0.9
0.85
0.8
0.75
16
32
64
128
256
512
1024
2048
4096
1024
2048
4096
number of visual words
(a) 1 M=10 M=15
C3 recall rate
0.8
0.6
0.4
0.2
0
16
32
64
128
256
512
number of visual words
(b) Fig. 6. Recognition results. (a) recognition rate. (b) C3-specific recall rate. Table 1. Confusion matrix of the best performance obtained at M = 10 with 1024 visual words. Recognition rate is 93.5%, while C3-specific recall rate is 60.5%. Right table shows precision, recall, and F-measures for each class. A
B C3
A 347 8 3 B 8 443 5 C3 13 21 52
type A B C3
precision 94.3% 93.9% 86.7%
recall F-measure 96.9% 95.6% 97.2% 95.5% 60.5% 71.2%
Table 2. Confusion matrix of the best performance obtained at M = 5 with 1857 visual words (625 clusters per class). Recognition rate is 94.1%, while C3-specific recall rate is 73.6%. Right table shows precision, recall, and F-measures for each class. A
B C3
A 347 5 5 B 12 436 8 C3 8 15 64
type A B C3
precision 94.6% 95.6% 83.1%
recall F-measure 97.2% 95.9% 95.6% 95.6% 73.6% 78.0%
A System for Colorectal Tumor Classification
461
1
recognition rate
0.95
M=10 M=15 M=5
0.9
0.85
0.8
0.75
75
375
1875
number of visual words
(a) 1
C3 recall rate
0.8
M=10 M=15 M=5
0.6
0.4
0.2
0
75
375
1875
number of visual words
(b) Fig. 7. Recognition results with concatenation of class-wise histograms. (a) recognition rate. (b) C3-specific recall rate.
as the number of visual words increases and M decreases. Also, the C3-specific recall rates are improved. The best performance 94.1% was achieved when M = 5 at 1857 visual words, and C3-specific recall rate is 73.6%. The confusion matrix is shown in Table 2.
4
Conclusions
We have reported experimental results for a recognition system for classifying NBI images of colorectal tumors into type A, B, and C3. Using a dataset of 907 NBI images, a method combining densely sampled SIFT descriptors and Bag-of-features has been examined. As a result, the maximum recognition rate was 94.1%. In the experiments, we have shown that visual words constructed by concatenating class-specific descriptor histograms performed better, with lower computational cost than visual words with global clustering of all descriptors.
References 1. Vital Statistics, Ministry of Health, Labour and Welfare, Japan (2009), http://www.mhlw.go.jp/english/database/db-hw/vs01.html 2. Colon and Rectal Cancer, National Cancer Institute, U.S National Institutes of Health (2010), http://www.cancer.gov/cancertopics/types/colon-and-rectal
462
T. Tamaki et al.
3. Global Burden of Disease: 2004 update, Health statistics and informatics Department, World Health Organization (2008), http://www.who.int/evidence/bod 4. Hirata, M., Tanaka, S., Oka, S., Kaneko, I., Yoshida, S., Yoshihara, M., Chayama, K.: Evaluation of microvessels in colorectal tumors by narrow band imaging magnification. Gastrointestinal Endoscopy 66(5), 945–952 (2007) 5. Hirata, M., Tanaka, S., Oka, S., Kaneko, I., Yoshida, S., Yoshihara, M., Chayama, K.: Magnifying endoscopy with narrow band imaging for diagnosis of colorectal tumors. Gastrointestinal Endoscopy 65(7), 988–995 (2007) 6. Kanao, H., Tanaka, S., Oka, S., Hirata, M., Yoshida, S., Chayama, K.: Narrowband imaging magnification predicts the histology and invasion depth of colorectal tumors. Gastrointestinal Endoscopy 69(3), 631–636 (2009) 7. H¨ afner, M., Kwitt, R., Uhl, A., Wrba, F., Gangl, A., V´ecsei, A.: Computer-assisted pit-pattern classification in different wavelet domains for supporting dignity assessment of colonic polyps. Pattern Recognition 42(6), 1180–1191 (2009) 8. Kwitt, R., Uhl, A.: Multi-directional Multi-resolution Transforms for ZoomEndoscopy Image Classification. In: Computer Recognition Systems 2. Advances in Soft Computing, pp. 35–43. Springer, Heidelberg (2007) 9. Kwitt, R., Uhl, A.: Modeling the Marginal Distributions of Complex Wavelet Coefficient Magnitudes for the Classification of Zoom-Endoscopy Images. In: ICCV (2007) 10. H¨ afner, M., Kendlbacher, C., Mann, W., Taferl, W., Wrba, F., Gangl, A., Vecsei, A., Uhl, A.: Pit pattern classification of zoom-endoscopic colon images using histogram techniques. In: Proc. of the 7th Nordic Signal Processing Symposium (NORSIGf 2006), pp. 58–61 (2006) 11. Karkanis, S.A., Iakovidis, D.K., Maroulis, D.E., Karras, D.A., Tzivras, M.: Computer-aided tumor detection in endoscopic video using color wavelet features. IEEE Transactions on Information Technology in Biomedicine 7, 141–152 (2003) 12. Maroulis, D.E., Iakovidis, D.K., Karkanis, S.A., Karras, D.A.: CoLD: A versatile detection system for colorectal lesions in endoscopy video frames. Computer Methods and Programs in Biomedicine 70(2), 151–186 (2003) 13. Hirota, M., Tamaki, T., Kaneda, K., Yosida, S., Tanaka, S.: Feature extraction from images of endoscopic large intestine. In: Proc. of the 14th Korea-Japan Joint Workshop on Frontiers of Computer Vision (FCV 2008), pp. 94–99 (2008), http://ir.lib.hiroshima-u.ac.jp/00021053 14. Kudo, S., Hirota, S., Nakajima, T., Hosobe, S., Kusaka, H., Kobayashi, T., Himori, M., Yagyuu, A.: Colorectal tumours and pit pattern. Journal of Clinical Pathology 47(10), 880–885 (1994) 15. Kudo, S., Tamura, S., Nakajima, T., Yamano, H., Kusaka, H., Watanabe, H.: Diagnosis of colorectal tumorous lesions by magnifying endoscopy. Gastrointestinal Endoscopy 44(1), 8–14 (1996) 16. Stehle, T., Auer, R., Gross, S., Behrens, A., Wulff, J., Aach, T., Winograd, R., Trautwein, C., Tischendorf, J.: Classification of Colon Polyps in NBI Endoscopy Using Vascularization Features. In: Proc. of Medical Imaging 2009: ComputerAided Diagnosis, SPIE, vol. 7260, p. 72602S-1.12 (2009) 17. Gross, S., Stehle, T., Behrens, A., Auer, R., Aach, T., Winograd, R., Trautwein, C., Tischendorf, J.: A Comparison of Blood Vessel Features and Local Binary Patterns for Colorectal Polyp Classification. In: Proc. of Medical Imaging 2009: Computer-Aided Diagnosis, SPIE, vol. 7260, p. 72602Q-1.8 (2009) 18. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)
A System for Colorectal Tumor Classification
463
19. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual Categorization with Bags of Keypoints. In: ECCV 2004 Workshop on Statistical Learning in Computer Vision (2004) 20. Vedaldi, A., Fulkerson, B.: VLFeat: An Open and Portable Library of Computer Vision Algorithms (2008), http://www.vlfeat.org/ 21. Bosch, A., Zisserman, A., Munoz, X.: Image Classification using Random Forests and Ferns. In: ICCV (2007) 22. Nowak, E., Jurie, F., Triggs, B.: Sampling Strategies for Bag-of-Features Image Classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 490–503. Springer, Heidelberg (2006) 23. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. IJCV 73(2), 213–238 (2007)
A Linear Solution to 1-Dimensional Subspace Fitting under Incomplete Data Hanno Ackermann and Bodo Rosenhahn Leibniz University, Hannover {ackermann,rosenhahn}@tnt.uni-hannover.de
Abstract. Computing a 1-dimensional linear subspace is an important problem in many computer vision algorithms. Its importance stems from the fact that maximizing a linear homogeneous equation system can be interpreted as subspace fitting problem. It is trivial to compute the solution if all coefficients of the equation system are known, yet for the case of incomplete data, only approximation methods based on variations of gradient descent have been developed. In this work, an algorithm is presented in which the data is embedded in projective spaces. We prove that the intersection of these projective spaces is identical to the desired subspace. Whereas other algorithms approximate this subspace iteratively, computing the intersection of projective spaces defines a linear problem. This solution is therefore not an approximation but exact in the absence of noise. We derive an upper boundary on the number of missing entries the algorithm can handle. Experiments with synthetic data confirm that the proposed algorithm successfully fits subspaces to data even if more than 90% of the data is missing. We demonstrate an example application with real image sequences.
1
Introduction
In this work, we consider the problem of estimating the linear subspace of a matrix of rank-1 if not all entries of this matrix are known. Since the maximization of a linear homogeneous equation system amounts to fitting a 1-dimensional subspace to the coefficient matrix, the mathematical community already considered the additional difficulty of unknown coefficients in the late 60s and 70s [8, 10, 17, 18, 22, 23]1 . In computer vision, subspace fitting with missing data received considerable attention after the introduction of the factorization algorithm by Tomasi and Kanade [20]. In the same work, the authors also proposed an algorithm which computes left and right subspaces starting with some part of the matrix where all entries are known. This solution is iteratively propagated until all unknown entries are estimated. However, it is likely that the cameras which gave rise to the initial submatrix are in degenerate configuration, hence subsequent estimates would be incorrect. 1
Please notice that for the Wiberg-algorithm, [22] is usually cited. However, this publication is missing, at least in the issue available online.
R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 464–476, 2011. Springer-Verlag Berlin Heidelberg 2011
A Linear Solution to 1D Subspace Fitting under Incomplete Data
465
Jacobs proposed to first estimate the intersections of all triplets of affine subspaces induced by points having unknown coordinates [13]. The subspace is taken to be identical with the space which contains all intersections, i.e. the intersection of intersections. In general, however, even in the absence of any noise, this intersection is empty. Many other approaches in the computer vision community are variants of the approach of Tomasi and Kanade, for instance by integrating statistical reliability measures [7] or optimizing a different cost function [3]. Modified EM-schemes [8, 23] were also proposed, e.g. considering some statistical properties [19], or a different cost [2]. Hartley and Schaffalitzky alternatingly estimate left and right subspaces [11]. Buchanon and Fitzgibbon [4] proposed to improve convergence speed by combining a Gauss-Newton approach [17, 18] with the method of [11]. Recently, it was proposed to integrate problem-specific constraints into EMschemes [1, 16]. A new type of algorithms was recently proposed: heuristics which minimize the l0 -, the l1 -norm, or the nuclear norm allow for the recovery of low-rank matrices with partially missing entries, or even matrices with some entries corrupted [6, 9]. Minimizing a spectral norm implies that the number of non-zero singular values is not kept fixed. Consequently, such algorithms can converge to a solution where the number of non-zero singular values is different as required by the physical model. According to our experience this frequently happens indeed. Relying on convex optimization, they require certain parameters specified apriori, and they are known to be slow if the problem size becomes large. Lastly, their gradientdescent nature makes them susceptible to local minima. Summarizing, Gauss-Newton approaches [4, 17, 18] are susceptible to local minima hence they require good initial solutions. Conversely, EM-schemes [1, 8, 16, 19, 22, 23] or EM-like schemes [11] seem to converge more robustly, yet can require tremendously much time [4]. Approaches generalizing some estimate obtained from some part of the data to the complete data are highly susceptible to noise, larger amounts of missing data and degenerate camera configurations [13, 20]. In this work we consider the special case of subspaces of rank-1. We will prove that for this case a linear solution exists and is unique. Subspace estimation in the presence of missing matrix entries reduces to solving a single linear equation system, hence there is no necessity to iterate. No parameters need be chosen apriori, and the algorithm is immune to local minima. The proposed approach is significantly faster than iterative approaches even for large problems. Experimental evaluation will show that our algorithm estimates accurate subspace even if less than 10% of the data is known. Its usefulness is further demonstrated with several challenging real image sequences. In Section 2 we will prove existence and uniqueness of a linear solution for the estimation of a rank-1 subspace if not all matrix entries are known. Section 3 continues with an experimental evaluation of accuracy using simulated data. A model application with real-image data is presented in Section 4. The paper ends with conclusions in Section 5.
466
2
H. Ackermann and B. Rosenhahn
A Linear Solution
Capitalized letters M are used for scalar constants, bold lower-case letters x for vectors, bold capital letters X for matrices, and calligraphic capital letters A for spaces. The symbol x denotes the transpose of a vector or matrix, and · the L2 -norm. Definition 1. Let S ∈ RM be a linear space of dimensionality 1, and let s be a vector of length 1 spanning S.
If some point x = [ x1 · · · xM ]
lies in the subspace S spanned by s we have 1 x s = 1. x
(1)
Let X be a M ×N matrix whose column vectors consist of N points x1 , . . . , xN . If X is completely known, the vector s spanning the subspace is given by the left singular vector corresponding to the largest singular value of X [14]. In the absence of noise, all singular values except the largest one are identically zero. If not all coordinates of a particular vector xi , i = 1, . . . , n, are known, the set of possible vectors gives rise to an affine space Ai xi = Ai yi + ti .
(2)
Here, the matrix Ai consists of d basis vectors ej where d is the number of unknown coordinates of xi . Each basis vector ej equals 1 at the coordinate corresponding to the missing coordinate of wi and is zero elsewhere. The vector ti consists of the known coordinates of xi and is zero at the coordinates corresponding to the missing coordinates of xi . Vector yi is chosen so that Eq. (2) holds true. Confer to the left plot of Figure 1. Lemma 1. The affine space Ai and subspace S intersect. Proof. Since xi is a point of Ai , by Eq. (1) it also is a point of S, hence Ai and S intersect. Let Pi denote the projective space spanned by Ai and the origin (shown in the right plot of Figure 1). Given basis vectors p1 , . . . , pN spanning Pi we can express any point x on Ai by x = c1 p 1 + · · · + cN p N .
(3)
Theorem 1. The projective space Pi and subspace S intersect. Proof. Ai ∈ Pi and Ai ∩ S, hence Pi ∩ S must hold true. Theorem 2. The intersections S ∩ Pi and S ∩ Pj , i = j, are identical. Proof. By theorem 1 both Pi and Pj intersect with S. Therefore two points xi ∈ Pi and xj ∈ Pj exist which are in S and in Pi and Pj , respectively.
A Linear Solution to 1D Subspace Fitting under Incomplete Data
467
Assuming that two different lines of intersection si and sj of length 1 exist, we have x i /xi si = 1 and xj /xj sj = 1 by Eq. (1). This implies that S must be spanned at least by si and sj which violates Definition 1. Hence the assumption that si = sj must be incorrect. Theorem 2 proves that all projective spaces Pi and Pj , i = j, intersect in s. It also proves the uniqueness of the solution. The right of figures 1 depicts the idea of the proof for 3 points xi , i=1,2,3, having one unknown coordinate each. Here, projective spaces Pi intersect in a single line s on which the points xi are located.
Fig. 1. Left: Three points (red dots) located on a line (blue). Affine spaces induced by unknown coordinates are indicated by red lines. Right: Projective planes (shown in red) through each affine space (red lines) intersect in a single line (blue) which is identical to the subspace on which the data resides.
The remaining question is how to compute s: Let Ni be the orthogonal complement of Pi . A direct consequence of Theorem 2 is that N1 ⊥ s, . . . , NN ⊥ s, i.e. s is orthogonal to all spaces Ni , i = 1, · · · , N . We therefore have ⎡ ⎤ N1 ⎢ .. ⎥ ⎣ . ⎦ s = 0.
(4)
(5)
NN
where Ni denotes the matrix consisting of all normal vectors to Pi . To determine the M variables of s, at least M equations are necessary. Each point xi ∈ RM with ui unknown coordinates induces a projective space of dimension ui + 1. Hence, M − (ui + 1) normal vectors exist for each point xi . In the absence of noise, if N (M − (ui + 1)) ≥ M (6) i=1
the subspace can be estimated.
468
H. Ackermann and B. Rosenhahn
Algorithm 1. Linear 1-Dimensional Subspace Estimation 1: Input: Vectors xi of which not all coordinates are known. 2: Output: The subspace s on which all complete vectors xi would reside. 3: Separate each vector xi into known coordinates ti and unknown coordinates giving rise to an affine space Ai so Eq. (2) holds true. 4: The basis vectors ej of the affine space Ai and the vector ti span the projective space Pi . Compute the orthogonal complement Ni of Pi by SVD. 5: Create matrix N in Eq. (4). 6: Compute the 1-dimensional nullspace of N by SVD according to Eq. (5).
The only necessary assumption taken is that the points xi are indeed located on a line intersecting the origin, so X has rank 1 if all its entries are known. If s does not intersect the origin, i.e. is affine, X has rank 2. The algorithm for 1D subspace fitting under incomplete data is summarized in Algorithm 1.
3
Synthetic Evaluation
For a synthetic evaluation of accuracy and required computation time, we randomly created 1000 points in a 100-dimensional space satisfying the rank-1 property. Minimal and maximal values were about −150 and 150, respectively. We then estimated the subspace using our proposed method and compare it with PowerFactorization [11], an EM-algorithm where subspace and missing data are estimated alternatingly, and an implementation of a nuclear norm minization algorithm (NNM) [5]2 . Our implementation of the EM-algorithm is similar to [19] yet without considering statistical reliability of trajectories. All algorithms were implemented using MatLab. The angular difference between the ground truth and the estimated vector was adopted as measure for the error. We randomly removed entries from the data matrix consisting of all points. We gradually increased the number of entries to be removed in steps of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 98%. To evaluate the robustness three different levels of independent, normally distributed noise were added to the data. Standard deviations of noise were set to 0, 0.1, and 1.0. The last level is equivalent to a strong pixel noise of about 3%. For each combination of noise and sampling ratio, all four algorithms were executed ten times and the average angular error and computation time computed. Since PowerFactorization and the EM-approach are susceptible to local minima we randomly initialized them 50 and 5 times, respectively, and took the best result. According to our experience, choosing more than 50 or 5 repetitions does not change the optimal result. All computations were performed on an Intel Core2 Quad CPU with 3.0GHz. Figure 2a shows the average angular errors if there was no noise in the data (left), small pixel noise with standard deviation σ = 0.1 (middle), and strong noise with σ = 1.0. The x-axis shows the amount of missing entries in percent, the y-axis the average angular error. 2
The code is generously provided at svt.caltech.edu.
50 40 30 20
50 40 30 20
30% 50% 70% 90% 98% Amount Missing Entries [Percent]
10%
600 500 400 300 200 100 30% 50% 70% 90% 98% Amount Missing Entries [Percent]
600 500 400 300 200 100 10%
469
70 60 50 40 30 20 10 10%
30% 50% 70% 90% 98% Amount Missing Entries [Percent]
Computation Time [Seconds]
Computation Time [Seconds]
60
10
700
10%
70 Accuracy [Degrees]
60
10 0 10%
(b)
Accuracy [Degrees]
70
Computation Time [Seconds]
(a)
Accuracy [Degrees]
A Linear Solution to 1D Subspace Fitting under Incomplete Data
30% 50% 70% 90% 98% Amount Missing Entries [Percent]
30% 50% 70% 90% 98% Amount Missing Entries [Percent]
600 500 400 300 200 100 10%
30% 50% 70% 90% 98% Amount Missing Entries [Percent]
Fig. 2. (a) Average angular difference between ground-truth and estimation for several sampling ratios. The red dash-dotted line indicates PowerFactorization, the green dashed line the EM-algorithm, the cyan dotted line nuclear norm minimization and the blue solid line the proposed method. (b) Average computation time in seconds. The proposed method needs between 0.43sec and 0.84sec. Even for a sampling ratio of 98% where PowerFactorization is fastest, our method is 25 times faster while being much more accurate. Nuclear norm minization is between 12 and 17 times slower for all sampling ratios and noise levels while being significantly less accurate even for relatively complete data.
The red dash-dotted line indicates the error due to PowerFactorization, the green dashed line the EM-method, the cyan dotted line NNM, and the solid blue line our algorithm. As can be seen, for less than 80% of missing entries, PowerFactorization and the EM-method perform approximately same as good as the proposed method, only NNM is between 1.4 and almost 10000 times less accurate. Beyond 80% missing data, the proposed algorithm is much more accurate than all other algorithms. NNM regularly fails to estimate a subspace of dimension 1, but computes a higher-dimensional solution. The average computation times are shown in Fig. 2b. The left plot in Fig. 2b shows the computation time in seconds on the y-axis if the data is noise-free, the middle plot if the noise is small (σ = 0.1), and the right plot shows results for strong noise (σ = 1.0). The proposed procedure requires less between 0.43sec and 0.84sec of time for all three experiments. Our algorithm is between 12 and 17 times faster than NNM, and between 25 and 240 times faster than PowerFactorization. For all combinations of sampling ratios and noise, the proposed method is at least same as accurate while being magnitudes faster and immune to local minima. Nuclear norm minimization usually fails to estimate the correct subspace from simple data like the one shown in Fig. 1.
470
4
H. Ackermann and B. Rosenhahn
Real-Image Application
In this section we present a real image application. It is known that 3D-reconstruction from point correspondences can be achieved by means of an iterative version of the so-called factorization algorithm [12, 15]. It involves fitting a 4-dimensional subspace to data. However, this algorithm cannot be used if some data is missing. If the camera is calibrated and does not rotate, the rank-4 constraint can be reduced to rank-1. Under incomplete data, the method of Sec. 2 can be applied then. In Sec. 4.1 we shortly review the projective factorization algorithm [12, 15]. In Sec. 4.2 we will consider the case when not all feature points are known in all images. We will show that the standard rank-4 constraint reduces to a rank-1 problem if intrinsic and rotation parameters are known. 3D-reconstructions from challenging out-door sequences are shown in Sec. 4.3. 4.1
Projective Factorization
Assuming a perspective camera model, the ith camera is defined by the matrix Pi = K [ Ri | ti ] .
(7)
Here, vector ti contains the translation parameters of the ith camera, matrix Ri the rotation parameters, and matrix K the intrinsic parameters such as focal length f , principal point p = [ px py ] and photo sensor scales ax and ay ⎤ ⎡ f ax 0 p x (8) K = ⎣ 0 f ay p y ⎦ . 0 0 1 While we assume here that intrinsic parameters are known and fixed throughout the sequence, varying intrinsics are possible as long as they are known which is the case for many real-world applications. The jth homogeneous 3D-point Xj = [ Xj Yj Zj 1 ] is projected onto an image point xij = [ uij vij 1 ] by λij xij = Pi Xj .
(9)
The projective depth λij is a scalar to ensure that the third coordinate of xij equals 1. Given N points in M images, we may arrange all image measurements xij into an observation matrix W ⎤ ⎡ ⎤ ⎡ λ11 x11 · · · λ1N x1N P1 ⎥ ⎢ .. ⎥ ⎢ .. .. .. (10) W =⎣ ⎦ = ⎣ . ⎦ X1 · · · XN . . . .
λM1 xM1 · · · λMN xMN PM X P
With some risk of confusion, denote the stack of all camera matrices by P and the matrix of all homogeneous 3D-points by X. Equation (10) implies that the columns of matrix W span a 4-dimensional subspace if all scalars λij are known.
A Linear Solution to 1D Subspace Fitting under Incomplete Data
Equation (10) may be written as I − P P W = 0,
471
(11)
if the column vectors of P have unit length and are mutually orthogonal. The the identity matrix. Equation (11) can be used to compute symbol Idenotes as left nullspace of W . I − PP Conversely, if P is known, the projective depths λij may be computed by solving (12) I − P P Dj λ j = 0 for λj where Dj denotes a 3M × M matrix consisting of the measurements xij of the jth trajectory on its diagonal, and λj the vector consisting of the corresponding variables λij . If both the variables λij and the camera matrices P are unknown, a projective reconstruction can be estimated by initializing the λij , for instance to 1, and iteratively solving Eqs. (11) and (12). To avoid a trivial solution we impose the constraint that the columns of W should have unit length. If all observations xij are normalized to length 1, this constraint is satisfied if all vectors λj have length 1 which is automatically enforced by standard tools for computing the singular value decomposition. 4.2
Missing Observations
The procedure in Sec. 4.1 may only be used if all observations xij are known across all images. Unfortunately, this is a rather unrealistic requirement. For instance, feature points can disappear due to failure of the tracking software, or due to occlusion in the scene. Conversely, the method introduced in Sec. 2 can fit subspaces to incomplete data yet it requires a rank-1 constraint. In the following, we will show that the rank-4 constraint can be reduced to rank-1 so a projective reconstruction can be computed from data where feature points are missing: In many real-world problems, the intrinsic camera parameters are known or can be easily estimated. Furthermore, in applications like car or robot navigation, information about the rotation of the vehicle is available. If both K and all rotation matrices Ri are known, the only remaining unknown is the
. The known measurements xij are weighted with vector t = t 1 · · · tM −1 ˜ ij are initialized to √13 [ 1 1 1 ] . From Eq. (11) K . Unknown observations x we have the requirement that I − tt W = 0.
(13)
However, Eq. (13) does not sufficiently constrain t. It has also be orthogonal to the known subspace spanned by ⎤ ⎡ R1 ⎥ ⎢ R = ⎣ ... ⎦ . (14) RM
472
H. Ackermann and B. Rosenhahn
Since vector t should be orthogonal to R, it must reside in the orthogonal complement of R. Letting ⎡ ⎤ N1 ⎢ ⎥ N = ⎣ ... ⎦ , (15) NN
where the matrices Ni are the normals to the projective spaces induced by trajectory vectors with missing coordinates. If we combine the orthogonality constraint with Eq. (5) we then obtain (16) I − RR N t = 0. If the vector t has been determined, the missing observations can be estimated by minimizing the distance to the subspace P = [ R t ] I − P P Λj d j = 0 (17)
where Λj denotes a 3M ×3M matrix with vectors [ λij λij λij ] on each triple of its diagonal entries, and dj the vector of all observations x1j · · · xMj . Similar to [1], dj can be separated into known and unknown feature points thus the unknown observations can be estimated by solving a linear equation system. Since the variables λij are not known generally, it is necessary to estimate the ˜ ij . We obtain vector t, the projective depths λij , and all missing observations x a projective reconstruction by iterating the following three steps: estimate t by ˜ ij by Eq. (17), and optimize the Eq. (16), estimate all unknown observations x projective depths λij by Eq. (12). Iterations may be terminated if the difference of reprojection errors M
=
N
1 δij K (xij − N [Pi Xj ]) L i=1 j=1
(18)
between two consecutive iterations becomes sufficiently small. Here, δij equals one if xij was observed and is zero otherwise. The symbol N [·] denotes the scale normalization to make the third coordinate equal to 1. We use the normalization L=
M N
δij .
(19)
i=1 j=1
4.3
Real-Image Sequences
Figures 3a, 4a, and 5a each show six images of three sequences which were taken with the video function of a consumer-level digital camera (Canon PowerShot A530). The images have size 640 × 480 pixels and are of low quality. Correspondences were established using a standard software (voodoo camera tracker3 ). We did not check manually for matching errors, so it is possible that outliers arepresent in the data. Particular trajectories, e.g. such of short length, 3
www.digilab.uni-hannover.de/docs/manual.html
A Linear Solution to 1D Subspace Fitting under Incomplete Data
473
(a)
(b)
Fig. 3. (a) Six images of a 50-image sequence with 1700 trajectories. A total of 40.5% of all feature points (indicated by the red points) is known. The iterative projective factorization converged to a reprojection error of 3.24[px]. (b) 3D-reconstruction using the algorithm proposed in Sec. 4.2. The right wall is planar. The angle between floor and the two walls is not exactly perpendicular, but both walls are almost orthogonal. The depth difference between left wall and wood logs is clearly recognizable.
(a)
(b)
Fig. 4. (a) Six images of a 60-image sequence with 1500 trajectories. A total of 36.7% of all feature points (indicated by the red points) is known. The iterative projective factorization converged to a reprojection error of 3.0[px]. (b) 3D-reconstruction using the algorithm proposed in Sec. 4.2. The left image shows a side view, the middle one a front view, and the right image a view from top. While the curvature of the well is slightly underestimated, it is well perceivable. Well and floor are perpendicular, and the depth ratios between plants (green points) and well are reasonable.
474
H. Ackermann and B. Rosenhahn
(a)
(b)
Fig. 5. (a) Six images of a 60-image sequence with 2000 trajectories. A total of 31.38% of all feature points (indicated by the red points) is known. The iterative projective factorization converged to a reprojection error of 2.3[px]. (b) 3D-reconstruction using the algorithm proposed in Sec. 4.2. The walls of the building, the tree to the left and the stairs in the lower left part of the images can be well perceived. The angle between the two main walls is not exactly rectangular but still looks good.
were not excluded either. Rotational motion is small but present since we did not carefully arrange the camera. This induces some model-noise which is not independently distributed. Intrinsic calibration parameters were determined using [21]. The 3D-reconstructions are not further optimized by a bundle adjustment. The sequence corresponding to Fig. 3a consists of 1700 trajectories which were tracked over 50 images. A total of 40.5% of all feature points are known. The left plot of Fig. 6 indicates where feature points are known and where unknown. The horizontal axis indicates the number of the trajectory, the vertical axis the image number. Feature points missing in some image are marked by black, present features by white. We stopped the projective factorization algorithm after 30 iterations resulting in a reprojection error of 3.24 (pixels). Figure 3b shows three images of the 3D-reconstruction. The outline of the wooden logs in the left part is clearly visible. The angle between ground plane and the two planes is not exactly rectangular, but the angle between both planes is quite orthogonal (shown in the third, top-view image), and the surface of the right wall is almost perfectly planar what can be seen in the first image. The second sequence, corresponding to the six images shown in Fig. 4a, consists of 1500 trajectories over 60 images with 36.7% of the feature points being known. Absence and presence of feature points is shown by the middle plot in Fig. 6. The algorithm converged to a reprojection error of 3.0[px] after 50 iterations. Three images of the 3D-reconstruction of the sequence are shown in Fig. 4b. The well can be seen in the center of the images. While the curvature
A Linear Solution to 1D Subspace Fitting under Incomplete Data
20 30 40 50
10
10
20
20
Image Number
Image Number
Image Number
10
30 40
1000 Feature Number
1500
60
30 40 50
50 500
475
200
400
600 800 1000 1200 1400 Feature Number
60
500
1000 1500 Feature Number
2000
Fig. 6. Presence of the trajectories of the sequences shown in Figs. 3a (left), 4a (middle) and 5a (right). The number of each trajectory is indicated by the horizontal axis, the image number by the vertical axis. White represents known feature points, black absent points.
of the well is underestimated (shown in the third, top-view image), the angle between floor and well is almost rectangular. The depth ratios between plants (green points) and well look reasonable. The last sequence, corresponding to Fig. 5a, consists of 2000 trajectories over 60 images with 31.38% of the feature points tracked (cf. right plot of Fig. 6). The reprojection error was 2.3[px] after 20 iterations. The 3D-reconstruction is shown in Fig. 5b. The structure of the building is clearly visible. The two main walls, the tree in the left part and the stairs in the lower right part can be seen. The top-view image shows that the angle between the two walls is slightly distorted.
5
Conclusions
In this work, we presented an algorithm for fitting a 1-dimensional subspace to incomplete data. We proved that for this special case, a linear solution exists and is unique. Using simulated data we showed that for various levels of noise, the proposed algorithm is more accurate than other methods if more than 80% of the data is missing. It is magnitudes faster while being same as accurate or even more precise. We demonstrated that our method is able to compute 3D-reconstructions of low-quality real-image sequences with large amounts of unobserved correspondences.
References 1. Ackermann, H., Rosenhahn, B.: Trajectory Reconstruction for Affine Structurefrom-Motion by Global and Local Constraints. In: IEEE Computer Vision and Pattern Recognition (CVPR), Miami, Florida, USA (June 2009) 2. Aguiar, P., Xavier, J., Stosic, M.: Spectrally Optimal Factorization of Incomplete Matrices. In: IEEE Computer Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008) 3. Brand, M.: Incremental singular value decomposition of uncertain data with missing values. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 707–720. Springer, Heidelberg (2002)
476
H. Ackermann and B. Rosenhahn
4. Buchanan, A., Fitzgibbon, A.: Damped Newton Algorithms for Matrix Factorization with Missing Data. In: IEEE Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, pp. 316–322 (2005) 5. Cai, J.-F., Cand`es, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4), 1956–1982 (2010) 6. Cand`es, E.J., Recht, B.: Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(6), 717–772 (2009) 7. Chen, P., Suter, D.: Recovering the Missing Components in a Large Noisy LowRank Matrix: Application to SFM. IEEE Transactions on Pattern Analyis and Machine Intelligence 26(8), 1051–1063 (2004) 8. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977) 9. Fazel, M.: Matrix Rank Minimization with Applications. PhD thesis, Dept. Electrical Engineering, Stanford University (March 2002) 10. Gabriel, K., Zamir, S.: Lower rank approximation of matrices by least squares with any choice of weights. Techonometrics 21(4), 489–498 (1979) 11. Hartley, R., Schaffalizky, F.: PowerFactorization: 3D Reconstruction with Missing or Uncertain Data. In: Australia-Japan Advanced Workshop on Computer Vision (June 2002) 12. Heyden, A., Berthilsson, R., Sparr, G.: An Iterative Factorization Method for Projective Structure and Motion from Image Sequences. Image and Vision Computing 17(13), 981–991 (1999) 13. Jacobs, D.W.: Linear fitting with missing data for structure-from-motion. Computer Vision and Image Understanding 82(1), 57–81 (2001) 14. Kanatani, K.: Statistical Optimization for Geometric Computation: Theory and Practice. Elsevier Science Inc., New York (1996) 15. Mahamud, S., Hebert, M.: Iterative Projective Reconstruction from Multiple Views. In: IEEE Computer Vision and Pattern Recognition (CVPR), Hilton Head, SC, USA, pp. 430–437 (June 2000) 16. Marquez, M., Costeira, J.: Optimal Shape from Motion Estimation with Missing and Degenerate Data. In: IEEE Workshop on Application of Computer Vision (WACV), Copper Mountain, CO, USA (January 2008) 17. Ruhe, A.: Numerical computation of principal components when several observations are missing. Technical report, Dept. Information Processing, University of Umeda, Umeda, Sweden (April 1974) 18. Ruhe, A., Wedin, P.: Algorithms for separable nonlinear least squares problems. Society for Industrial and Applied Mathematics Review 22(3), 318–337 (1980) 19. Sugaya, Y., Kanatani, K.: Extending interrupted feature point tracking for 3D affine reconstruction. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 310–321. Springer, Heidelberg (2004) 20. Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: A factorization method. International Journal of Computer Vision 9(2), 137–154 (1992) 21. Tsai, R.: A versatile camera calibration technique for high-accuracy 3-d machine vision metrology using off-the-shelf cameras and lenses. IEEE Transaction on Robotics and Automation 3(4), 323–344 (1987) 22. Wiberg, T.: Computation of principal components when data are missing. In: Second Symp. on Computational Statistics, Berlin, Germany, pp. 229–236 (1976) 23. Wold, H.: Estimation of principal components and related models by iterative least squares. In: Krishnaiah (ed.) Multivariate Analysis, pp. 391–420 (1966)
Efficient Clustering Earth Mover’s Distance Jenny Wagner and Bj¨ orn Ommer Interdisciplinary Center for Scientific Computing, Heidelberg University, Germany
[email protected],
[email protected]
Abstract. The two-class clustering problem is formulated as an integer convex optimisation problem which determines the maximum of the Earth Movers Distance (EMD) between two classes, constructing a bipartite graph with minimum flow and maximum inter-class EMD between two sets. Subsequently including the nearest neighbours of the start point in feature space and calculating the EMD for this labellings quickly converges to a robust optimum. A histogram of grey values with the number of bins b as the only parameter is used as feature, which makes run time complexity independent of the number of pixels. After convergence in O(b) steps, spatial correlations can be taken into account by total variational smoothing. Testing the algorithm on real world images from commonly used databases reveals that it is competitive to state-of-theart methods, while it deterministically yields hard assignments without requiring any a priori knowledge of the input data or similarity matrices to be calculated.
1
Introduction
The mathematical concept of a distance between two probability distributions, which later became known as the Wasserstein metric, was formulated by L.N. Vaserstein in 1969 [1]. Twenty years later, the special case of the first Wasserstein metric was applied in image processing for the first time, when [2] used it for their multiple resolution analysis of images. Today, this distance measure, also called Earth Mover’s Distance (EMD), mainly serves as a similarity measure for point sets in feature space in a vast number of different applications. In [3], for instance, the EMD of two colour or texture histograms of different images is defined as a measure of similarity between those images for content-based image retrieval. [4] focus on an efficient large scale EMD implementation for shape recognition and interest point matching, which is also based on a robust histogram comparison. Contrary to all known applications of EMD so far, Efficient Clustering Earth Mover’s Distance (ECEMD) uses EMD to directly separate the feature space into two sets/ classes of points by finding the class assignment configuration for which the EMD between those classes is at maximum. There is no need to calculate large scale similarity matrices and no further (complex) algorithm for clustering is required. Using histograms as feature spaces, all pixels of the image are taken into account, i.e. subsampling of the image or restrictions of the distance to take R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 477–488, 2011. Springer-Verlag Berlin Heidelberg 2011
478
J. Wagner and B. Ommer
into account only a limited number of neighbours, as for large scale similarity matrices, are not necessary. Similar to [5], the algorithm can also be extended to subsequently segment an image into k regions of interest, extracting one class after the other out of the background set. Section 2 introduces the mathematical problem formulation as a convex optimisation problem, the related choice of feature space and discusses the theoretical advantages this definition implicates. Section 3 then covers all implementational aspects, starting from the algorithm that solves the clustering problem to the best initialisation that guarantees robust and fast convergence. In Section 4, the algorithm is applied to selected example images from online databases and the segmentation results are compared to the ground truth, to a standard segmenting approach [6] and a clustering approach that uses similarity measures [5]. Finally, Section 5 summarises the strengths and limits of the method presented here and gives an outlook of future research to improve the algorithm.
2
Problem Formulation and Related Work
Let n be the number of points xi (i ∈ {1, ..., n}) in feature space and denote the class labels ci = −1 or ci = +1 for back- and foreground assignment, respectively. Given a weight wi for each xi , assumed to be a pile of earth of height wi at point xi , it is denoted by wiF if xi is in the foreground F and wiB if xi is in the background. The optimal class assignments to F and B are chosen such that the EMD between these classes is maximal. The EMD itself can be understood as finding the minimum work required to transport all piles from one class to the other, respecting that the entire amount of earth has to be moved and that each xi can only acquire or transport a pile up to its own wi . This leads to the optimisation problem of (1) and (2). The first constraint assures that the work, also called flow, fij between each point xi ∈ F and xj ∈ B is unidirectional, the second and third that the flow from/ to one point xi ∈ F, B does not exceed the weight wiF ,B of this point. The fourth forces all weights of one class to be moved. ⎧ ⎫ m n−m ⎨ ⎬ max {EMD (F , B)} = max dij · fij min s.t. (1) c1 ,...,cn c1 ,...,cn ⎩ f ⎭ i=1 j=1
fij ≥ 0
j
fij ≤ wiF
fij ≤ wjB
i
i
j
fij = min
wiF ,
i
wjB
(2)
j
The distance dij between the pairs of xi , xj of the opposite classes in the objective function can be calculated as the Euclidean distance dij = ||xi − xj ||22
∀xi : ci = 1,
∀xj : cj = −1.
(3)
As can be read off the indices of the sums, m points have been assumed to belong to the foreground class and n − m points to be background for 1 ≤ m < n, m to be determined by the class assignments.
Efficient Clustering Earth Mover’s Distance
479
The problem defined in (1) and (2) aims at constructing a bipartite graph with minimal flow between the nodes of the fore- and background set, such that the fore- and background set have maximum distance. It belongs to the category of EXPTIME problems as the global optimum can be determined by 2n times solving the linear minimisation problem for all possible permutations of class assignments of the weighted points in feature space to the fore- and background. The idea is similar to maximum margin clustering, which can be formulated as the convex integer optimisation problem introduced in [7]. Yet, while the latter is often relaxed to a semi-definite program for applications or uses random class assignments as [8], the algorithm presented here does not require a relaxation to soft cluster assignments and deterministically converges to a robust optimum. Contrary to k-means or spectral clustering approaches, all linked by a general optimisation problem by [9], which minimise the intra-class variance, the approach presented here focuses on maximising the inter -class variance. Since histograms of intensity values have often been used successfully in combination with the EMD, these are also adapted here to show how the algorithm works in principle. To do so, the grey values of the image (without loss of generality assumed to be in the interval [0, 1]) are divided into n (equally distributed) bins, whose center coordinates are given by xi , i = 1, ..., n. Then, the weights wiF and wjB are determined as the normalised number of entries in the bins of the foreground set and the background set. It is important to note that the weights are normalised with respect to the total number of pixels in the histogram and not with respect to the number of pixels in the fore- or background class, as usually applied for calculating the EMD as a similarity feature. The overall normalisation used here takes account for the weight of each xi in the context of the image and not in the context of the cluster it is currently assigned to. Specifying the objects of interest further, data-adapted features of a different kind are also applicable. Having determined an optimal cluster configuration that mainly relies on proximity of vectors in feature space, continuous graph cuts as described in [10] can be applied to the resulting segmented image in order to account for spatial correlations in image space as well. This yields a denoised final segmentation, so that the boundaries between fore- and background are minimised by assembling contiguous regions, depending on the weight parameter λ that controls the degree of smoothing. Then, c) (4) inf n (c1 , ..., cn )T , c˜ + λ · TV(˜ c˜∈[0,1]
yields the final integer class labelling vector c˜ (˜ ci = 0 for background and c˜i = 1 for foreground) of all feature vectors xi .
3
Implementation
The implementational difficulty lies in solving the maximum problem in (1), i.e. finding the class assignments, which is an NP-hard combinatorial problem,
480
J. Wagner and B. Ommer
while calculating the EMD for a given class labelling is solving a linear program. Without a priori information about the object of interest, the algorithm can start with assigning a single arbitrary feature vector to the foreground, dividing the set of feature vectors into F consisting of only one point and B containing n − 1 points. For these class assignments the EMD is calculated. After that, the nearest neighbours of the feature vector in F that are stored in a previously calculated array are subsequently added to F until the EMD of the new cluster configuration becomes smaller than the previous one or all vectors are assigned to be in F . This procedure can be summarised in Algorithm 1. Algorithm 1. Algorithmic implementation of ECEMD 1 program (out: F, B) = ECEMD (in: x1 , ..., xn ) 2 F ← {x1 }; {initialise one vector in F} 3 B ← {x2 , ..., xn }; {rest of vectors in B} 4 kNN(x1 ) ← [1NN(x1 ), 2NN(x1 ), ..., (n − 1)NN(x1 )]; {calculate neighbours} 5 EMDlast ← 0; {initialise EMD-variable} 6 EMDnext ← EMD(F, B); {calculate first EMD} 7 k ← 1; {initialise loop to include nearest neighbours in F} 8 while (B NOT {}) OR (EMDnext < EMDlast ) do 9 F ← F ∪ {kN N (x1 )}; {append next nearest neighbour to F} 10 B ← B/ {kN N (x1 )}; {truncate this nearest neighbour from B} 11 EMDlast ← EMDnext ; {store last EMD solution} 12 EMDnext ← EMD(F, B); {calculate next EMD solution} 13 k ← k + 1; 14 end while 15 return F, B;
To assure fast and robust convergence in the histogram feature space considered here, the first foreground vector is best defined to be the bin with the smallest or the largest bin coordinate. For the one-dimensional case, using the first or last bin is the best choice, since this results in the desired intensity value separation with convergence after including those of the bins to the foreground that lead to an almost equally weighted cluster distribution. In the (multi-dimensional) general case, however, the initialisation can be either done by the user or by a more problem-adapted initialisation, making the approach semi-supervised. Using this k-nearest-neighbour approach takes into account the similarities in feature space but only requires O(n) distance calculations and amount of memory contrary to algorithms applying pairwise similarity matrices. Extending the idea of EMD-clustering to the multi-class case, iteratively extracting classes out of the feature space again yields hard cluster assignments. Since the total variational smoothing is also able to handle several classes, the clustering result can still be plugged into (4) for spatial correlation.
Efficient Clustering Earth Mover’s Distance
4
481
Experimental Results
For the sake of brevity, the advantages and limits of the algorithm can only be highlighted by a few examples and comparisons. Tuning of the method itself to be optimally adapted to a specific application is left to further work. Bin number. First, the effect of varying the number of histogram bins is investigated. In order to do so, the algorithm is applied to the image of size 64 × 64 shown in Fig. 1 (left) to divide its contents into two classes while the number of bins is varied from 4 to 32.
Fig. 1. Left: image Right: Dependence of the segmentation results of this image on the number of bins in feature space: red markers: correctly assigned percentage of pixels to the foreground, blue markers correctly assigned percentage of pixels to the background
Comparing the segmentation results with the ground truth, it can be observed that the number of correctly assigned fore- and background pixels is over 93% and varies only in the range of 1% with increasing number of bins. Counting the number of bins that are assigned to be fore- and background, it is interesting to note that the algorithm converges to almost equal numbers of fore- and background bins. Total variational smoothing. Using the same image for a fixed number of bins (16 in this case), total variational smoothing is applied after convergence and the dependence of the number of correctly assigned pixels to the smoothing parameter λ is investigated as summarised in Fig. 2, where λ is varied from 0 to 10. The graphs for the correctly assigned percentage of pixels show an increase in coincidence with the ground truth up to λ = 5, then the percentage decreases. This tendency can be explained by oversmoothing, observing the segmentations shown in Fig. 3 Combinatorial solution. Taking into account that the algorithm follows a greedy strategy, the local optimum of the result should be compared to the global optimum of the model defined by (1) and (2). By means of this, it is possible to evaluate the segmentation strength of the model in order to determine, whether the ansatz is appropriate in the first place. Furthermore, comparing the result
482
J. Wagner and B. Ommer
Fig. 2. Left: image Right: Dependence of the segmentation results of this image on the weight parameter λ of subsequent total variational smoothing: red markers: correctly assigned percentage of pixels to the foreground, blue markers correctly assigned percentage of pixels to the background
Fig. 3. From left to right: segmentation without total variational smoothing, with λ = 0.5 (visually closest to ground truth), with λ = 5 and with λ = 10
Fig. 4. Left: image Right: Dependence of the globally optimal segmentation results of this image on the number of bins in feature space: red markers: correctly assigned percentage of pixels to the foreground, blue markers correctly assigned percentage of pixels to the background
obtained by the greedy algorithm to the global optimum, the approximation gap to the NP-hard problem can be computed. Hence, for a small number of bins, a comparison of the globally optimal solution to the ground truth is performed. The number of bins is limited due to the exponential run time complexity of finding the combinatorial optimum of (1) under the constraints of (2).
Efficient Clustering Earth Mover’s Distance
483
As can be observed from the right hand side of Fig. 4, the results lie in the same range as the results obtained by the greedy algorithm. Quantifying the gap between the global optimum and the local one found by the algorithm, the deviations are less than 1.5%, where the correct foreground assignments are higher in the case of the greedy algorithm. On the average, the greedy and combinatorial algorithm assign 94.4% of all pixels correctly in the range of 4 to 16 bins. From this can be concluded that the model as well as the algorithm are reasonably chosen in the sense that they yield segmentations close to the ground truth. Comparision to competitive algorithms. For the comparison of the Efficient Clustering EMD to the ground truth and competitive approaches by means of example images from online databases shown below, the number of bins is set to 16 and λ = 0.5. The comparisons are made for the images shown in Tab. 1, calculating the relative confusion matrices
true FG false FG C= . (5) false BG true BG In the implementation of [5] the images were subsampled in order to handle the large amount of data in the similarity matrices, so that only up to 5% of the image pixels determined the segmentation and a simple Gaussian distance of the RGB colour vectors was chosen as similarity measure between those points. As the results show, the Efficient Clustering EMD yields better recognition rates than the standard approaches in those cases where the simple intensity value histogram features are appropriate to separate fore- from background. Furthermore, total variational smoothing does only lead to improvements in recognition of about 1%, as already found out in Fig. 2 (right). In the special case of the tiger (fourth image of Tab. 1), the gain in recognition applying total variational smoothing is much higher, as ECEMD assigns the orange pixels of the tiger to foreground and the dark stripes to background, so that spatial correlation completes the striped coat to one contiguous region. Visually comparing the outcome of ECEMD with total variation to the competitive segmentations, as shown in Fig. 5, the former comes closest to the ground truth (left), detailed confusion matrices of each algorithm can be found in Tab. 1.
Fig. 5. From left to right: ground truth, ECEMD with total variational smoothing, segmentation by [5], segmentation by [6]
484
J. Wagner and B. Ommer
Table 1. Example images and comparison results to ground truth in form of relative confusion matrices for own approaches and competitive methods, images can be found in the Appendix Image datails
own
holes 100 × 100 own creation
horse part 256 × 256 Weizmann DB
horse 717 × 525 Weizmann DB
tiger 481 × 321 Berkeley Segm. DB
llama 513 × 371 GrabCut DB
man 321 × 481 Berkeley Segm. DB
horses 481 × 321 Berkeley Segm. DB
swimmer 481 × 321 Berkeley Segm. DB
birds 481 × 321 Berkeley Segm. DB
astronauts 481 × 321 Berkeley Segm. DB
0.99 0.16 0.01 0.84 0.96 0.07 0.04 0.93 0.99 0.24 0.01 0.76 0.63 0.03 0.37 0.97 0.14 0.02 0.86 0.98 0.77 0.00 0.23 1.00 0.88 0.24 0.12 0.76 0.82 0.30 0.18 0.70 0.95 0.43 0.05 0.57 0.85 0.07 0.15 0.93
own + TV
1.00 0.05 0.00 0.95 0.96 0.07 0.04 0.93 0.99 0.24 0.01 0.76 0.85 0.05 0.15 0.95 0.27 0.03 0.73 0.97 0.77 0.00 0.23 1.00 0.89 0.24 0.11 0.76 0.82 0.30 0.28 0.70 0.96 0.43 0.04 0.57 0.86 0.07 0.14 0.93
[5] 1.00 0.31 0.00 0.69 0.97 0.33 0.03 0.67 0.99 0.53 0.01 0.47 0.82 0.33 0.18 0.67 0.77 0.52 0.23 0.48 0.56 0.00 0.44 1.00 0.99 0.37 0.01 0.63 0.82 0.29 0.18 0.71 0.92 0.35 0.08 0.65 0.90 0.48 0.10 0.52
[6]
0.94 0.03 0.06 0.97 0.85 0.20 0.15 0.80 0.64 0.53 0.36 0.47 0.74 0.43 0.26 0.57 0.93 0.45 0.07 0.55 0.60 0.08 0.40 0.92 0.72 0.68 0.27 0.32 0.64 0.59 0.35 0.41 0.18 0.76 0.82 0.24 0.88 0.01 0.12 0.99
Multi-class segmentation. Multi-class segmentation can be implemented similar to [5] by iterative application of the algorithm. At first, one foreground class is extracted, then, the next class is segmented out of the remaining background points until the number of predefined classes is reached. This algorithm leads to results shown in Fig. 6. Comparing the original image with the segmentation results, good coincidence can be observed, noting that the binning only depends
Efficient Clustering Earth Mover’s Distance
485
Fig. 6. Top: original images Bottom: multi-class segmentation with three classes
on intensity values. As especially the third and fourth image show, however, total variational denoising or texture based features could further improve the results. Run time measurements. As the algorithm is implemented in MATLAB, the MATLAB profiler can be used to determine the run time of the algorithm in a last experiment. Scaling the size of the horse head image from its original size of 256 × 256 pixels over 128 × 128, 64 × 64 and 32 × 32 down to 16 × 16 pixels and measuring the overall run time of the algorithm (without variational smoothing but with image loading and histogram creation), it is observed that the total amount of actual CPU time used for calculations is always less than 2 seconds and independent of the number of pixels on a MacBook Pro Model 3.1 (2.4 GHz Intel Core 2 Duo, 4GB DDR2RAM). This result could have been expected beforehand, since the algorithm operates only on the histogram with fixed number of bins and already sorted bin coordinates. The run time measurements in Fig. 7 on the left show the dependence on the number of pixels of five different images processed by [5] compared to ECEMD for the same images. It can be observed that ECEMD (without total variational smoothing) is weakly dependent on the number of pixels, which originates from the dependence of the histogram creation on the number of pixels. ECEMD, processing all pixels, is faster than [5], taking into account that [5] only uses 5% of all pixels for clustering and furthermore requires the calculation of a similarity matrix of complexity O(n2 ), which takes 1200 seconds for 500 pixels and is not included in the time measurements. But, even without the calculation of the similarity matrix, [5] is slower for more than 3000 pixels. Compared to [6], which has a linear time complexity dependent on the number of pixels as shown in the right graph of Fig. 7, ECEMD including histogram creation is faster than [6]. For small images, the run time is comparable, while for increasing number of pixels, ECEMD is at least six times faster yielding comparable segmentation results.
486
J. Wagner and B. Ommer
Fig. 7. Left: Run time comparison of ECEMD and [5] without creation of similarity matrices Right: Run time comparison of ECEMD and [6]
5
Summary and Outlook
In summary, the Efficient Clustering Earth Mover’s Distance (ECEMD) is formulated as an unsupervised convex integer program that applies the Earth Mover’s Distance to directly separate fore- and background, inspired by the principles of maximum margin clustering. The implementation starts at one foreground point in feature space and calculates a short series of linear programs that subsequently include the nearest neighbours of the starting point into the foreground set until it deterministically converges to an integer optimum that lies close to ground truth. If the chosen feature space was not distinctive enough to separate the classes correctly, spatial correlation among the pixels can be taken into account afterwards, applying total variational smoothing to enhance the result. As was shown in Section 4, the latter can only improve the image by a significant amount in those cases when the feature space is inappropriate to segment the data. But recognition rates that surpass those of [5] and [6] and approximate the ground truth with more than 90% correctly assigned pixels in cases where the feature space is suitably chosen, there is no space for more than fine tuning. Advantages of ECEMD certainly lie in the facts that the bin number of the histogram is the only parameter to be put in, that no subsampling of the image and no previous training step is required to process the data. Compared to other approaches that use the EMD, the hard cluster assignments of ECEMD are furthermore advantageous, as well as the short run time of the ECEMD which is independent of the number of input pixels in its current implementation that uses an intensity value histogram as feature space (see Section 3). Yet, exact benchmark tests, for example the Berkeley Benchmark, still remain to be evaluated for the algorithm in order to prove its qualitatively high segmentation power on a large amount of various images. From the results of this test could additionally be concluded whether the simple intensity feature suffices for all kind of object categories or in how far category adapted features (e.g. texture, edges, colour) can improve these results. A benchmark test also offers the opportunity to find a common platform to compare ECEMD to other approaches
Efficient Clustering Earth Mover’s Distance
487
like k-means or maximum margin clustering to investigate in which cases the maximisation of inter-class-variance of ECEMD excels over these ansatzes. In the future, improvements to the algorithm itself can be made in form of user interaction defining the starting point of the algorithm as one image pixel, patch or region that is contained in the object of interest. To achieve this, the algorithm could be included in a user-in-painting framework as developed for GrabCut [11] or Ilastik [12]. As far as parallelisation is concerned, it is possible to use the experimental observation that ECEMD usually converges to clusters with equal numbers of bins, so that the EMD calculation of the different cluster partitions could be run in parallel and after termination, the results are compared to find the optimal assignments. Acknowledgement. JW gratefully acknowledges the financial support of the Heidelberg Graduate School of Mathematical and Computational Methods for the Sciences.
References 1. Vaserstein, L.N.: On the stabilization of the general linear group over a ring. Math. USSR-Sbornik 8, 383–400 (1969) 2. Peleg, S., Werman, M., Rom, H.: A unified approach to the change of resolution: Space and gray-level. IEEE Trans. Pattern Anal. Mach. Intell. 11, 739–742 (1989) 3. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40, 99–121 (2000) 4. Ling, H., Okada, K.: An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Trans. Pat. Anal. Mach. Intell. 29, 840–853 (2007) 5. Pavan, M., Pelillo, M.: Dominant sets and pairwise clustering. IEEE Trans. Patt. Anal. Mach. Intell. 29, 167–172 (2007) 6. Cour, T., Benezit, F., Shi, J.: Spectral segmentation with multiscale graph decomposition. In: CVPR 2005: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, vol. 2, pp. 1124–1131. IEEE Computer Society, Los Alamitos (2005) 7. Xu, L., Neufeld, J., Larson, B., Schuurmans, D.: Maximum margin clustering. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 17, pp. 1537–1544. MIT Press, Cambridge (2005) 8. Gieseke, F., Pahikkala, T., Kramer, O.: Fast evolutionary maximum margin clustering. In: ICML 2009: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 361–368. ACM, New York (2009) 9. Zass, R., Shashua, A.: A unifying approach to hard and probabilistic clustering. In: Proc. ICCV, vol. 1, pp. 294–301 (2005) 10. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005) 11. Rother, C., Kolmogorov, V., Blake, A.: “grabcut”: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23, 309–314 (2004) 12. Sommer, C., Straehle, C., K¨ othe, U., Hamprecht, F.: Interactive learning and segmentation tool kit 0.8, rev 475 (2010)
488
J. Wagner and B. Ommer
Appendix: Images Table 2. Images used for experimental evaluation
holes
horse part
horse
tiger
llama
man
birds
horses
swimmer
astronauts
One-Class Classification with Gaussian Processes Michael Kemmler , Erik Rodner , and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena, Germany {Michael.Kemmler,Erik.Rodner,Joachim.Denzler}@uni-jena.de http://www.inf-cv.uni-jena.de
Abstract. Detecting instances of unknown categories is an important task for a multitude of problems such as object recognition, event detection, and defect localization. This paper investigates the use of Gaussian process (GP) priors for this area of research. Focusing on the task of one-class classification for visual object recognition, we analyze different measures derived from GP regression and approximate GP classification. Experiments are performed using a large set of categories and different image kernel functions. Our findings show that the well-known Support Vector Data Description is significantly outperformed by at least two GP measures which indicates high potential of Gaussian processes for one-class classification.
1
Introduction
Many applications have to deal with a large set of images from a single class (positive examples) and only few or zero learning examples from a counter class (negative examples). Learning a classifier in such situations is known as one-class classification (OCC), novelty detection, outlier detection, or density estimation. The latter concentrates on estimating a well-defined probability distribution from the data, whereas OCC is more related to binary classification problems without the need for normalization. Application scenarios often arise due to the difficulty of obtaining training examples for rare cases, such as images of defects in defect localization tasks [1] or image data from non-healthy patients in medical applications [2]. In these cases, one-class classification (OCC) offers to describe the distribution of positive examples and to treat negative examples as outliers which can be detected without explicitly learning their appearance. Another motivation to use OCC is the difficulty of describing a background or counter class. This problem of finding an appropriate unbiased set of representatives exists in the area of object detection or content based image retrieval [3, 4]. Earlier work concentrates on density estimation with parametric generative models such as single normal distributions or Gaussian mixture models [5]. These methods often make assumptions about the nature of the underlying distribution. Nowadays, kernel methods like one-class Support Vector Machines [6]
The first two authors contributed equally to this paper.
R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 489–500, 2011. Springer-Verlag Berlin Heidelberg 2011
490
M. Kemmler, E. Rodner, and J. Denzler
(also called 1-SVM) or the highly related Support Vector Data Description (SVDD [7]), offer to circumvent such assumptions in the original space of feature vectors using the kernel trick. Those methods inherit provable generalization properties from learning theory [6] and can handle even infinite dimensional feature spaces. In this paper we propose several approaches to OCC with Gaussian process (GP) priors. Machine learning with GP priors allows to formulate kernel based learning in a Bayesian framework [8] and has proven to be competitive with SVM-based classifiers for binary and multi-class categorization of images [9]. Nevertheless, their use for OCC scenarios has mostly been studied in the case of proper density estimation [10], which requires sophisticated MCMC techniques to obtain a properly normalized density. Alternatively, Kim et al. [11] present a clustering approach which indirectly uses a GP prior and the predictive variance. Our new variants for OCC with GP priors include the idea of [11] as a special case. We further investigate the suitability of approximate inference methods for GP classification, such as Laplace approximation (LA) or expectation propagation (EP) [8]. We additionally show how to apply our GP-based and a common SVM-based method to the task of object recognition by utilizing kernel functions based on pyramid matching [12]. The remainder of the paper is structured as follows. First, we briefly review classification with GP priors. Our approach to one-class classification with GP is presented in Sect. 3. Section 4 describes Support Vector Data Description, a baseline method used for comparison. Kernel-based OCC for visual object recognition, as explained in Sect. 5, is utilized to evaluate and compare all methods and their variants in Sect. 6. A summary of our findings and a discussion of future research directions conclude the paper.
2
Classification via Gaussian Processes
This section gives a brief introduction to GP classification. Since classification is motivated from non-parametric Bayesian regression, we first briefly introduce the regression case with real-valued outputs y ∈ R, before we discuss approximate methods for GP classification with binary labels y ∈ {−1, 1}. 2.1
The Regression Case
The regression problem aims at finding a mapping from input space X to output space R using labeled training data X = [x1 , . . . , xn ] ∈ X n , y = [y1 , . . . , yn ]T ∈ Rn . In the following it is assumed that an output y is generated by a latent function f : X → R and additive noise ε, i.e. y = f (x)+ε. Rather than restricting f to a certain parametric family of functions, we only assume that the function is drawn from a specific probability distribution p(f |X). This allows for a Bayesian treatment of our problem, i.e. we infer the probability of output y∗ given a new input x∗ and old observations X, y by integrating out the corresponding T non-observed function values f∗ = f (x∗ ) and f = [f (x1 ), . . . , f (xn )] :
One-Class Classification with Gaussian Processes
491
p(y∗ |X, y, x∗ ) =
p(f∗ |X, y, x∗ ) =
p(f∗ |X, y, x∗ ) p(y∗ |f∗ ) df∗
(1)
p(f∗ |X, f , x∗ ) p(f |X, y) df
(2)
The central assumption in GP regression is a GP prior over latent functions, i.e. all function values are jointly normally distributed: f |X ∼ N (m(X), κ(X, X))
(3)
This distribution is solely specified by the mean function m(·) and covariance function κ(·, ·). If we further assume that the additive noise ε is modelled by a zero-mean Gaussian distribution, i.e. p(y|f ) = N (f, σn2 ), we are able to solve the integrals in closed form. Using a zero-mean GP, the predictive distribution (2) is again Gaussian [8] with moments −1 μ∗ = kT∗ K + σn2 I y −1 σ∗2 = k∗∗ − kT∗ K + σn2 I k∗
(4) (5)
using abbreviations K = κ(X, X), k∗ = κ(X, x∗ ), and k∗∗ = κ(x∗ , x∗ ). Since we assume i.i.d. Gaussian noise, this also implies that (1) is normally distributed with mean μ∗ and variance σ∗2 + σn2 . 2.2
From Regression to Classification
The goal in binary GP classification is to model a function which predicts a confidence for each class y ∈ {−1, 1}, given a feature vector x. In order to make probabilistic inference about the output given a training set, we can directly apply the Bayesian formalism from eq. (1) and (2). However, the key problem is that the assumption of Gaussian noise no longer holds, since the output space is discrete. We could either ignore this issue and perform regression on our labels, or we could use a more appropriate likelihood such as the cumulative Gaussian 1 p(y|f ) = √ 2π
1 2 exp − x dx 2 −∞
yf
(6)
The disadvantage of the latter procedure is that our predictive distribution (2) is no longer a normal distribution. To overcome this issue, we follow the standard approach to approximate the posterior p(f |X, y) with a normal distribution pˆ(f |X, y). Two well-known approaches, which are also used in this work, are Laplace approximation (LA) and Expectation Propagation (EP). The interested reader is referred to [8]. For the final prediction step, approximations pˆ(f |X, y) are used to solve (1). Using both Gaussian approximations to the posterior (2) and cumulative Gaussian likelihoods p(y|f ), it can be shown that the predictive distribution (1) is also equal to a cumulative Gaussian and can thus be evaluated in closed form [8].
492
M. Kemmler, E. Rodner, and J. Denzler 1.5 1 0.5 0 −0.5
−1
0
1
x
2
3
4
*
Fig. 1. GP regression using a zero-mean GP prior in an one-dimensional OCC setting. The predictive distribution is visualized via its mean and corresponding confidence interval (scaled variances), where training points are marked as crosses.
3
One-Class Classification with Gaussian Process Priors
GP approaches from Sect. 2 are discriminative techniques to tackle the problem of classification [8]. This follows from equation (1) where the conditional density p(y∗ |X, y, x∗ ) is modeled. Using discriminative classification techniques directly for one-class classification is a non-trivial task, due to the fact that the density of the input data is not taken into account. Nevertheless, utilizing a properly chosen Gaussian process prior enables us to derive useful membership scores for OCC. The main idea is to use a mean of the prior with a smaller value than our positive class labels (e.g. y = 1), such as a zero mean. This restricts the space of probable latent functions to functions with values gradually decreasing when being far away from observed points. In combination with choosing a smooth covariance function, an important subset of latent functions is obtained which can be employed for OCC (c.f. Fig. 1). This highlights that the predictive probability p(y∗ = 1|X, y, x∗ ) can be utilized, in spite of being a discriminative model. Due to the fact that the predictive probability is solely described by its first and second order moments, it is natural to also investigate the power of predictive mean and variance as alternative membership scores. Their suitability is illustrated in Fig. 1: The mean decreases for inputs distant from the training data and can be directly utilized as an OCC measure. In contrast, the predictive variance σ∗2 is increasing which suggests that the negative variance value can serve as an alternative criterion for OCC. The latter concept is used in the context of clustering by Kim et al. [11]. Additionally, Kapoor et al. [9] propose the predictive mean divided by the standard deviation Table 1. Different measures derived from the predictive distribution which are suitable for OCC membership scores Mean (M)
μ∗ = E (y∗ |X, y, x∗ )
neg. Variance (V) −σ∗2 = −V(y∗ |X, y, x∗ )
Probability (P)
p(y∗ = 1|X, y, x∗ )
Heuristic (H)
μ∗ · σ∗−1
One-Class Classification with Gaussian Processes Reg−P
Reg−M
Reg−V
493
Reg−H
Fig. 2. One-class classification using GP regression (Reg) and measures listed in Table 1. All measures capture the distribution quite well.
as a combined measure for describing the uncertainty of the estimation and applied this heuristic successfully in the field of active learning. All variants, which are summarized in Table 1, are available for GP regression and approximate GP classification with Laplace approximation or Expectation Propagation. The different membership scores produced by the proposed measures are visualized in Fig. 2 using an artificial two-dimensional example. In the following sections we additionally motivate the use of the mean and variance of GP regression by highlighting the strong relationship to Parzen estimation and normal density distributions respectively. 3.1
Predictive Mean Generalizes Parzen Estimators
There exist various techniques which aim at constructing a proper probability density of observed data. Kernel density estimation, also known as Parzen estimation, is one strategy to achieve this goal. This non-parametric method constructs a probability density by superimposing similarity functions (kernels) on observed data points. Our one-class classification approach using the mean of GP regression has a tight relationship to Parzen estimators. If we assume noise-free observations (σn = 0) and no correlations between training examples (K = I), the regression mean (4) simplifies to n
μ∗ = kT∗ K−1 y ∝
n
n
1 −1 1 K ij · κ(x∗ , xi ) = κ(x∗ , xi ) n j=1 i=1 n i=1
(7)
Please note that by construction y = (1, . . . , 1)T . The proposed GP regression mean can hence be seen as an unnormalized Parzen density estimation technique using scaling matrix K, which is implicitly given by covariance function κ(·, ·). 3.2
Predictive Variance Models a Gaussian in Feature Space
One approach to describe the data is to estimate a normal distribution in feature space H induced via a mapping Φ : X → H. It has been shown by P¸ekalska et al. [13] that computing the variance term in GP regression is equal to the Mahalanobis distance (to the data mean in feature space) if the regularized T
(σn > 0) kernel-induced scaling matrix Σ = Φ(X) Φ(X) + σn2 I is used: T 2 −1
(X, X) + σn
(X, x) (8)
(x, x) − κ
(x, X) κ κ Σ −1 Φ(x) I ∝ κ Φ(x)
494
M. Kemmler, E. Rodner, and J. Denzler
where the tilde indicates operations on zero-mean normalized data, i.e. Φ(x) = n −1 T
Φ(x) − n Φ(x ) and κ
(x, x ) = Φ(x) ). Since the GP variance arΦ(x i i=1 gument from (5) does not utilize centered kernel matrices, we effectively use the logarithm of the unnormalized zero-mean Gaussian which best describes the data. For the case of constant κ(x, x), we are softly “slicing” a hyperellipsoid from the data points which are distributed onto a sphere in feature space.
4
Support Vector Data Description as Baseline Method
In the following section, we present one-class classification with Support Vector Data Description (SVDD) [14] which will be used for comparison due to its common use for OCC applications. The main idea of this approach is to estimate the smallest hypersphere given by center c and radius R which encloses our training data X. This problem can be tackled in various ways [15], but it is most conveniently expressed using the quadratic programming framework [14]. In addition to assuming that all training points reside within the hypersphere, one can also account for outliers using a soft variant which allows some points xi to be a (squared) distance ξi away from the hypersphere. Operating in feature space induced via a mapping Φ : X → H to some Hilbert space H, this objective can be solved by the following optimization problem: min R2 +
c,R,ξ
1 ξi subject to ||Φ(xi ) − c||22 ≤ R2 + ξi , ξi ≥ 0 νn
(9)
i∈I
for all i ∈ I := {1, . . . , n}. The parameter ν ∈ (0, 1] controls the impact that hypersphere outliers have on the optimization objective, e.g. using small values for ν leads to a strongly regularized objective and thus promotes solutions where few data points lie outside the hypersphere. It should be noted that the Euclidean distance in equation (9) can be generalized to a Bregmanian distance [16]. The primal problem can be transferred to the dual optimization problem which is solely expressed in terms of inner products κ(x, x ) = Φ(x)T Φ(x ) and thus enables the use of the kernel trick: max diag(K)T α − αT Kα α
subject to 0 ≤ αi ≤
n 1 for i ∈ I and αi = 1 νn i=1
(10) (11)
n where the center of the ball can be expressed as c = i=1 αi Φ(xi ) and the squared radius is given by R2 = argmaxi ||Φ(xi ) − c||22 taking into account all 1 indices i ∈ I satisfying 0 < αi < νn . A newly observed point Φ(x∗ ) thus lies within the hypersphere if ||Φ(x∗ ) − c||22 = αT Kα − 2k∗ T α + k∗∗ ≤ R2 holds, using abbreviations from Sect. 2.1. It should be noted that there exists a tight relationship to Support Vector Machines. It can be shown that for kernels satisfying κ(x, x ) = const for all
One-Class Classification with Gaussian Processes
495
x, x ∈ X , the SVDD problem is equal to finding the hyperplane which separates the data from the origin with the largest margin (1-SVM [6]). Other loss functions can also be integrated, e.g. quadratic loss instead of hinge loss which leads to Least-Squares Support Vector Machines (LS-SVM, [17]). It is interesting to note that LS-SVM with a zero bias term is equivalent to using the predictive mean estimated by GP regression. Our extension to one-class classification problems, therefore, directly corresponds to the work of Choi et al. [18] in this special case.
5
One-Class Classification for Visual Object Recognition
Our evaluation of one-class classification with GP priors is based on binary image categorization problems. In the following section we describe how to calculate an image-based kernel function (image kernel) with color features. In addition we also utilize the pyramid of oriented gradients (PHoG) kernel which is based on grayscale images only [19]. Bag-of-Features and Efficient Clustering. The Bag-of-Features (BoF) approach [12] has been developed in the last years to one of the state-of-the-art feature extraction methods for image categorization problems [20]. Each image is represented as a set of local features which are clustered during learning. This clustering is utilized to compute histograms which can be used as single global image descriptors. We extract Opponent-SIFT local features [20] for each image by dense sampling of feature points with a horizontal/vertical pixel spacing of 10 pixels. Clustering is often done by applying k-Means to a small subset of local features. In contrast, it was shown by Moosmann et al. [21] that a Random Forest can be utilized as a very fast clustering technique. The set of local features is iteratively split using mutual information and class labels. Finally, the leafs of the forest represent clusters which can be utilized for BoF. We extend this technique to cluster local features of a single class by selecting a completely random feature and its median value in each inner node to perform randomized clustering. Spatial Pyramid Matching. Given global image descriptors, one could easily apply kernel functions such as radial-basis function (RBF) or polynomial kernels. The disadvantage of this method is that these kernels do not use the position of local features as an additional cue for the presence of an object. The work of Lazebnik et al. [12] shows that incorporating coarse, absolute spatial information with the spatial pyramid matching kernel (S-PMK) is advantageous for image categorization. To measure the similarity between two images with S-PMK, the image is divided recursively into different cells (e.g. by applying a 2 × 2 grid). As in [12], BoF histograms are calculated using the local features in each cell and combined with weighted histogram intersection kernels.
496
M. Kemmler, E. Rodner, and J. Denzler
Table 2. Mean AUC Performance of OCC methods, averaged over all 101 classes. Bold font is used when all remaining measures are significantly outperformed. GP measures significantly superior to SVDDν (with optimal ν) are denoted in italic font.
PHoG Color PHoG Color
6
Reg-P 0.696 0.761 LA-M 0.684 0.745
Reg-M 0.693 0.736 EP-M 0.683 0.744
Reg-V 0.692 0.766 LA-V 0.686 0.758
Reg-H 0.696 0.755 EP-V 0.685 0.757
LA-P EP-P 0.684 0.683 0.748 0.747 SVDD0.5 SVDD0.9 0.690 0.685 0.739 0.746
Experiments
In this section we analyze the proposed approach and its variants empirically which results in the following main outcomes: 1. OCC with the variance criterion estimated by GP regression (Reg-V) is significantly better than all other methods using the color image kernels and it outperforms SVDD for various values of the outlier ratio ν (Sect. 6.1). 2. Approximate GP classification with LA and EP does not lead to a better OCC performance (Sect. 6.1). 3. The performance of the mean of GP regression (Reg-M) varies dramatically for different categories and can even decrease with an increasing amount of training data (Sect. 6.2). 4. Parameterized image kernels offer additional performance boosts with the disadvantage of additional parameter tuning (Sect. 6.3). 5. OCC with image kernels is able to learn the appearance of specific subcategories (Sect. 6.4). We perform experiments with all 101 object categories of the Caltech 101 database [22]. As performance measure we use the area under the ROC curve (AUC) which is estimated by 50 random splits in training and testing data. In each case a specific number of images from a selected object category is used for training. Testing data consists of the remaining images from the category and all images of the Caltech background category. 6.1
Evaluation of One-Class Classification Methods
To assess the OCC performance, we use 15 randomly chosen examples for training. Instead of performing a detailed analysis of each single category, we average the AUC over all classes and random repetitions to yield a final performance summary for each OCC method. Based on this performance assessment scheme, we compare predictive probability (-P), mean (-M) and variance (-V) of GP regression (Reg) and GP classification using Laplace Approximation (LA) or Expectation Propagation (EP), respectively. We additionally analyze the heuristic μ∗ · σ∗−1 for GP regression (Reg-H) and compare with SVDD using outlier
One-Class Classification with Gaussian Processes
0.97
0.85 AUC (Leopards)
AUC (Faces)
0.965 0.96 0.955
Reg-M Reg-V LA-V SVDD0.5 SVDD0.9
0.95 0.945
497
50
100
examples
150
0.8 0.75 0.7 0.65 0.6 0.55
50
100
150
examples
Fig. 3. Results for color feature based image kernels regarding classes Faces (Left) and Leopards (Right) with varying number of training examples (using same legend)
fraction ν ∈ {0.1, 0.2, . . . , 0.9} (SVDDν ). The results for PHoG and color features are displayed in Table 2, which, for the sake of readability, lists only best performing SVDD measures. It can be immediately seen that PHoG features are significantly inferior to color features. Therefore, experiments in subsequent sections only deal with color-based image kernels. Although the average performance of all measures are quite similar, SVDD is significantly outperformed for all tested ν (t-test, p ≤ 0.025) by at least two GP measures. The method of choice for our task is GP regression variance (Reg-V) which significantly outperforms all other methods using color features. Employing PHoG based image kernels, Reg-V also achieves at least comparable performance to SVDD for any tested parameter ν. Our results also highlight that making inference with cumulative Gaussian likelihoods does not generally improve OCC, since LA and EP are consistently outperformed by GP regression measures Reg-V and Reg-P. Hence, the proposed OCC measures do not benefit from the noise model of (6) (and corresponding approximations) that are more suitable for classification in a theoretical sense. 6.2
Performance with an Increasing Number of Training Examples
To obtain an asymptotic performance behavior of all outlier detection methods, we repeat the experiments of Sect. 6.1 with a densely varying number of training examples. As can be seen in Fig. 3, the performance behavior highly depends on the class. Classifying Faces, the performance increases with a higher number of training examples in almost all cases. A totally different behavior, however, is observed for Leopards, where the averaged AUCs of Reg-M (and related Reg-P and Reg-H) substantially decrease when more than 8 training examples are used. For small ν, SVDD also exhibits this behavior in our experiments. 6.3
Influence of an Additional Smoothness Parameter
Estimating the correct smoothness of the predicted distribution is one of the major problems in one-class classification and density estimation. This smoothness is often controlled by a parameterized kernel, such as an RBF kernel. In
M. Kemmler, E. Rodner, and J. Denzler
0.964
AUC (Leopards)
AUC (Faces)
498
0.96 0.956
Reg-M -4
-2
0 log(β)
2
4
0.9 0.8 0.7 0.6
Reg-M -4
-2
0 log(β)
2
4
Fig. 4. Influence of an additional smoothness parameter β of a re-parameterized image kernel on the OCC performance for the categories Faces (Left) and Leopards (Right)
contrast, our used kernel functions are not parameterized and the decreasing performance of the Reg-M method in the last experiment might be due to this inflexibility. For further investigation, we parameterize our image kernel function by transforming it into a metric, which is then plugged into a distance substitution kernel [23]: κβ (x, x ) = exp (−β (κ(x, x) − 2κ(x, x ) + κ(x , x ))). We perform experiments with κβ utilizing 100 training examples and a varying value of β. The results for the categories Faces and Leopards are plotted in Fig. 4. Let us first have a look on the right plot and the results for Leopards. With a small value of β, the performance is comparable to the unparameterized version (cf. Fig. 3, right side). However, by increasing the parameter we achieve a performance above 0.9 and superior to other methods, such as Reg-V. This behavior differs significantly from the influence of β on the performance of the task Faces, which decreases after a small maximum. Right after the displayed points, we ran into severe numerical problems in both settings due to small kernel values below double precision. We expect a similar gain in performance by tuning the scale parameter of the cumulative Gaussian noise model, but we skip this investigation to future research. This analysis shows that introducing an additional smoothness hyperparameter offers a great potential, though efficient optimization using the training set is yet unsolved. 6.4
Qualitative Evaluation
Instead of separating a specific category from the provided background images of the Caltech database, we also apply our OCC methods to the difficult task of estimating a membership score of a specific sub-category. We therefore trained our GP methods and SVDD using 30 images of a type of chair called windsor chair, which has a characteristic wooden backrest. The performance is tested on all remaining windsor chairs and images of the category chairs. Results are illustrated in Fig. 5 with the best and last ranked images. The qualitative results are similar, but the AUC values clearly show that Reg-V is superior.
One-Class Classification with Gaussian Processes
AUC=0.876
standard chair
Reg-V
windsor chair
... -1.008
-1.412
-1.449
-1.541
0.158
0.141
0.139
AUC=0.766
-1.001
SVDD0.9
-0.991
... 0.851
499
0.847
0.831
Fig. 5. Results obtained by training different OCC methods for windsor chair and separating against chair : the three best ranked images (all of them are characteristic windsor chairs) and the three last ranked images with corresponding output values
7
Conclusions and Further Work
We present an approach for one-class classification (OCC) with Gaussian process (GP) priors and studied the suitability of different measures, such as mean and variance of the predictive distribution. The GP framework allows to use different approximation methods to handle the underlying classification problem such as regression, Laplace approximation and Expectation Propagation. All aspects are compared against the popular Support Vector Data Description (SVDD) approach of Tax and Duin [14] in a visual object recognition application using state-of-the-art image-kernels [12, 19]. It turns out that using the predictive variance of GP regression is the method of choice and, compared to SVDD, achieves significantly better results for object recognition tasks with color features. Using the estimated mean frequently leads to good results, but highly depends on the given classification task. We also show that the introduction of an additional smoothness parameter can lead to a large performance gain. Thus, estimating correct hyperparameters of the re-parameterized kernel function would be an interesting topic for future research. Another idea worth investigating is the use of specifically tailored noise models for OCC. In general, our new approach to OCC is not restricted to object recognition and its application to tasks such as defect localization would be interesting.
References 1. Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artificial Intelligence Review 22, 85–126 (2004) 2. Tarassenko, L., Hayton, P., Cerneaz, N., Brady, M.: Novelty detection for the identification of masses in mammograms. In: Fourth International Conference on Artificial Neural Networks, pp. 442–447 (1995) 3. Chen, Y., Zhou, X., Huang, T.S.: One-class svm for learning in image retrieval. In: Proceedings of the IEEE Conference on Image Processing (2001)
500
M. Kemmler, E. Rodner, and J. Denzler
4. Lai, C., Tax, D.M.J., Duin, R.P.W., P¸ekalska, E., Pacl´ık, P.: On combining oneclass classifiers for image database retrieval. In: Proceedings of the Third International Workshop on Multiple Classifier Systems, pp. 212–221 (2002) 5. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics), 1st edn. Springer, Heidelberg (2007) 6. Sch¨ olkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13, 1443– 1471 (2001) 7. Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54, 45–66 (2004) 8. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, Cambridge (2005) 9. Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.: Gaussian processes for object categorization. International Journal of Computer Vision 88, 169–188 (2010) 10. Adams, R.P., Murray, I., MacKay, D.: The gaussian process density sampler. In: NIPS (2009) 11. Kim, H.C., Lee, J.: Pseudo-density estimation for clustering with gaussian pro˙ cesses. In: Wang, J., Yi, Z., Zurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3971, pp. 1238–1243. Springer, Heidelberg (2006) 12. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2169–2178 (2006) 13. P¸ekalska, E., Haasdonk, B.: Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1017–1032 (2009) 14. Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54, 45–66 (2004) 15. Yildirim, E.A.: Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization 19, 1368–1391 (2008) 16. Crammer, K., Singer, Y.: Learning algorithms for enclosing points in bregmanian spheres. In: Learning Theory and Kernel Machines, pp. 388–402. Springer, Heidelberg (2003) 17. Suykens, J.A.K., Gestel, T.V., Brabanter, J.D., Moor, B.D., Vandewalle, J.: Least Squares Support Vector Machines. World Scientific Pub. Co., Singapore (2002) 18. Choi, Y.S.: Least squares one-class support vector machine. Pattern Recogn. Lett. 30, 1236–1240 (2009) 19. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, pp. 401–408 (2007) 20. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2010) 21. Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: NIPS, pp. 985–992 (2006) 22. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 594–611 (2006) 23. Vedaldi, A., Soatto, S.: Relaxed matching kernels for object recognition. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (2008)
A Fast Semi-inverse Approach to Detect and Remove the Haze from a Single Image Codruta O. Ancuti1 , Cosmin Ancuti1,2 , Chris Hermans1 , and Philippe Bekaert1 1
Hasselt University - tUL -IBBT, Expertise Center for Digital Media, Wetenschapspark 2, Diepenbeek, 3590, Belgium 2 University Politehnica Timisoara, Piata Victoriei 2, 300006, Romania
Abstract. In this paper we introduce a novel approach to restore a single image degraded by atmospheric phenomena such as fog or haze. The presented algorithm allows for fast identification of hazy regions of an image, without making use of expensive optimization and refinement procedures. By applying a single per pixel operation on the original image, we produce a ’semi-inverse’ of the image. Based on the hue disparity between the original image and its semi-inverse, we are then able to identify hazy regions on a per pixel basis. This enables for a simple estimation of the airlight constant and the transmission map. Our approach is based on an extensive study on a large data set of images, and validated based on a metric that measures the contrast but also the structural changes. The algorithm is straightforward and performs faster than existing strategies while yielding comparative and even better results. We also provide a comparative evaluation against other recent single image dehazing methods, demonstrating the efficiency and utility of our approach.
1
Introduction
In outdoor environments, light reflected from object surfaces is commonly scattered due to the impurities of the aerosol, or the presence of atmospheric phenomena such as fog and haze. Aside from scattering, the absorption coefficient presents another important factor that attenuates the reflected light of distant objects reaching the camera lens. As a result, images taken in bad weather conditions (or similarly, underwater and aerial photographs) are characterized by poor contrast, lower saturation and additional noise. Image processing applications commonly assume a relatively transparent transmission medium, unaltered by the atmospheric conditions. Outdoor vision applications such as surveillance systems, intelligent vehicles, satellite imaging, or outdoor object recognition systems need optimal visibility conditions in order to detect and process extracted features in a reliable fashion. Since haze degradation effects depend on the distance, as disclosed by previous studies [1, 2] and observed as well in our experiments (see Fig. 1), standard contrast enhancement filters such as histogram stretching and equalization, linear mapping, or gamma correction are limited to perform the required task introducing halos R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 501–514, 2011. Springer-Verlag Berlin Heidelberg 2011
502
C.O. Ancuti et al.
Fig. 1. Standard techniques limitations. From left to right: initial foggy images, histogram equalization, local contrast stretching and our restored result.
artifacts and distorting the color. The contrast degradation of a hazy image is both multiplicative and additive. Practically, the haze effect is described by two unknown components: the airlight contribution and the direct attenuation related to the surface radiance. The color ambiguity of the radiance is due to the additive airlight, which increases exponentially with the distance. Enhancing the visibility of such images is not a trivial task, as it poses an inherently underconstrained problem. A reliable restoration requires an accurate estimation of both the true colors of the scene and the transmission map, closely related to the depth-map. Recently, there has been an increased interest in the vision and graphics communities in dehazing single images [1–5]. In this paper we introduce an alternative approach to solve this challenging problem. Our technique is based on the remark that the distance from the observer to the scene objects is highly correlated with the contrast degradation and the fading of the object colors. More specifically, by an extensive study it has been disclosed an important difference between hazy and non-hazy image regions, by performing a per pixel comparison of the hue values in the original image to their values in a ’semi-inversed’ image. This ’semi-inversed’ image version is obtained by replacing the RGB values of each pixel on a per channel basis by the maximum of the initial channel value (r, g or b) and its inverse (1 − r,1 − g or 1 − b), followed by an image-wide renormalization. This observation has been validated on a large set of images, and allows for the detection of the hazy image regions by applying only a single simple operator. This facilitates the estimation of the airlight constant color, and enables us to compute a good approximation of the haze-free image using a layer-based approach. Contributions. This paper introduces the following three main contributions: - first of all, we introduce a novel single image algorithm for the automatic detection of hazy regions. - secondly, our approach works on a per pixel basis. This makes it suitable for parallelization, and allows us to retain sharp detail near edges. - finally, our layer-based fusion dehazing strategy yields comparative and even better restored results than the existing approaches but performs faster being suitable for real-time applications.
A Fast Semi-inverse Approach to Detect and Remove the Haze
2
503
Related Work
An important area of applications for dehazing algorithms can be found in multi-spectral remote sensing applications, where specialized sensors installed on satellites capture a specific band of the reflected light spectrum. Due to aerosol impurities and cloud obstruction, the recorded images require specific processing techniques [6, 7] to recover the original information. Many haze-removal techniques have used additional information in order to facilitate the search for a solution to this underconstrained problem. Some methods [8, 9] employ multiple images of the same scene, taken under various atmospheric conditions or combined with near-infrared version [10]. Polarization methods [11–13] exploit the fact that airlight is partially polarized. By taking the difference of two images of the same scene under different polarization angles, it becomes possible to estimate the magnitude of the polarized haze light component. Another category of techniques assume a known model of the scene. Narasimhan and Nayar [14] employ a rough approximated depth-map obtained by user assistance. Deep Photo [15] uses the existing georeferenced digital terrain and urban models to restore foggy images. By using an iterative registration method, they align the 3D model with the outdoor images, producing the depth map required by the restoration process. Recently, several single image based methods [1–5] have been introduced. The method of Fattal [1] uses a graphical model that solves the ambiguity of airlight color based upon the assumption that image shading and scene transmission are locally uncorrelated. The approach of [5] proposed a related method with [1] solution. This method models the image with a factorial MRF and computes the albedo and depth independently like two statistically independent latent layers. In Tan’s approach [2], the restoration aims to maximize the local contrast. He et al. [3] employ the dark channel image prior, based on statistical observation of haze-free outdoor images, in order to generate a rough estimation of the transmission map. Subsequently, due to the fact that they approximate the scene using patches of a fixed size, a matting strategy is required in order to extrapolate the value into unknown regions, and refine the depth-map. Tarel and Hautiere [4] introduced a contrast-based enhancing approach to remove the haze effects, aimed at being faster than the previous approaches.
3
The Optical Model
The optical model used in this paper is similar to the one employed in previous single image dehazing methods [1–4], initially described by Koschmieder [16]. For the sake of completeness, a brief description of this model is presented in the following section. When examining an outdoor scene from an elevated position, features gradually appear to become lighter and fading as they are closer towards the horizon. Only a percentage of the reflected light reaches the observer as a result of the absorption in the atmosphere. Furthermore, this light gets mixed with the airlight [16] color vector, and due to the scattering effects the scene color is
504
C.O. Ancuti et al.
Fig. 2. Optical model
shifted (illustrated in Fig. 2). Based on this observation, the captured image of a hazy scene Ih is represented by a linear combination of direct attenuation D and airlight A contributions: Ih = D + A = I ∗ t (x) + A∞ ∗ (1 − t (x))
(1)
where Ih is the image degraded by haze, I is the scene radiance or haze-free image, A∞ is the constant airlight color vector and t is the transmission along the cone of vision. This problem is clearly ill-posed, and requires us to recover the unknowns I, A∞ and t(x) from only a single input image Ih . In a homogeneous atmosphere, the transmission t is considered to be modulated as: t (x) = exp(−β ∗ d(x))
(2)
where β is the attenuation coefficient of the atmosphere due to the scattering and d represents the distance to the observer. From equation 1, it becomes apparent that the chrominance attenuation becomes increasingly influenced by the airlight, as the optical depth increases: A∞ ∗ (1 − t(x)) A = D I ∗ t(x)
(3)
Theoretically, if the transmission and the airlight are known the haze-free image can be easily computed: I = A∞ − (A∞ − Ih ) /t(x)
4
(4)
Haze Detection
The dark object method [6] is a well-known technique within the remote sensing community, where it is employed to remove haze from homogeneous scenes. More recently, He et al. [3] have presented a new derivation of this approach, called the dark channel strategy. A disadvantage of this new method is its inability to properly preserve edges, which is caused mainly by the employed erosion filter during the stage of computing the dark channel. In order to recover the refined
A Fast Semi-inverse Approach to Detect and Remove the Haze
505
transmission map and the latent image, this patch-based approach requires a complex postprocessing stage. By employing the dark channel prior [3], it has been shown that each patch of a natural image contains at least one point that is dark for non-sky or haze-free regions. The validity of this observation is mainly motivated by the fact that natural images are colorful and full of shadows [3]). In this work we introduce a novel per pixel method that aims at generalizing the previous dark-channel approach. During our experiments, in which we analyzed a large set of natural images degraded by haze, we have observed that in haze-free and non-sky images, pixels in the neighborhood of dark pixels have a low intensity value in at least one color channel (r, g or b). On the dark channel, patches representing sky and hazy regions contain high values, as the local minimal intensity of such patches is high. Similarly, it has been observed that pixels in sky or hazy areas have high values in all color channels. These observations confirm the assumption that values in hazy image patches vary smoothly, except at depth discontinuities. Based on these observations, we introduce a direct haze detection algorithm that r operates in a pixel-wise manner. We create a semi-inversed image Isi (x) = g b Isi , Isi . This image can be obtained by replacing the RGB values of each , Isi pixel x on a per channel basis by the maximum of the initial channel value and its inverse: r (x) = max [I r (x), 1 − I r (x)] Isi x∈I
g Isi (x) = max [I g (x), 1 − I g (x)] x∈I b Isi (x) = max I b (x), 1 − I b (x)
(5)
x∈I
r
g
b
where I (x), I (x) and I (x) represent the RGB channels of a considered image pixel x. Because the operations performed in equation 5 map the range of all pixels of the semi-inversed image Isi onto the range [0.5, 1], renormalization is required. The reason of hue disparity is due to the image characteristics that have been previously described. In haze-free areas since at least one-channel is characterized by small values the operation will replace that value with its inverse. In regions of sky or haze since all values are characterized by high values, the max operation will return the same values. Therefore, by this direct hue comparison of the semiinverse with the original image version, we are able to find pixels that need to be restored while conserving a similar color appearance with the original one. As illustrated in Fig. 3, this simple operation produces a semi-inversed image Isi in which hazy areas are rendered with enhanced contrast, while the unaltered areas appear as the inverse of the initial image. To identify the regions affected by haze, we compute the difference between the hue channels of the original image I and Isi , and threshold it using a predefined value τ . The value of τ facilitates the selection of those pixels that present similar aspect in both the initial and the semi-inverse version. The results in this paper have been generated with the default value τ = 10◦ . Only pixels that have a hue disparity below this threshold
506
C.O. Ancuti et al.
Fig. 3. Haze detection. The first row shows the original hazy images I. In the second row, we show the yielded semi-inversed image Isi . Finally, in the third row, we label the pixels identified as not under the influence of haze with a blue mask. In these regions, the intensity of the blue color is proportional with the hue disparity.
Fig. 4. Results from applying our haze detection procedure on a large data set of images. Overall, haze-free images contain 96% pixels labeled as haze-free (masked in blue), while hazy and sky images are characterised by a significant decrease in haze-free pixels (less than 13%).
A Fast Semi-inverse Approach to Detect and Remove the Haze
507
τ are labeled as hazy pixels. In our approach the hue information is represented by the h∗ channel after the image is transformed into the perceptual CIE L∗ c∗ h∗ color space. By applying this simple strategy, we are able to estimate the hazy regions with acceptable precision. In order to check the validity of our observation, we collected a large database of natural images from several accessible photo sites (e.g. Flickr.com, Picasaweb.com, Photosig.com). Note that all the selected images have been taken in daily light conditions. We defined three main categories of outdoor images: haze-free images without sky, sky images, and hazy images. After manually selecting 800+ images for each of these classes, we evaluated the variation of the hue using the strategy previously described. The main conclusion is that the haze-free images are characterized by a vast majority of pixels affected by significant hue variations, while in the other two categories this variation is considerably less. We illustrate this in Fig. 4. In order to differentiate between the latter categories, it is possible to detect sky regions using existing techniques [17]. For a comparison of our haze detection component with the dark channel method of He et al. [3], we refer to the discussion section.
5 5.1
Our Approach Airlight Color (A∞ ) Estimation
One important correlation for dehazing algorithms constitutes the relation between optical depth and airlight [11, 18]. The airlight A becomes more dominant as the optical depth increases. The optical model (equation 1) reveals the fact that two objects with different reflectance properties, located at the same distance from the observer, have identical airlight gains. Consequently, when observing the values of A in a small area around a scene point, they usually show only minor variations, except when depth discontinuities occur. Moreover, the A∞ constant can be acquired from the areas with the highest additive contribution, which are commonly the areas of the image characterized by high intensity. These properties of the hazy images have been exploited as well in the previous approaches to estimate the airlight constant A∞ . As observed by Narasimhan and Nayar [9], this constant is best estimated in the most haze-opaque areas. He et al. [3] choose the 0.1% brightest pixels of the dark channel as their preferred region. Another approach [2] is to search for this component in regions with the highest intensity, assuming that the sky is present and that there are no saturated pixels in the image. The key advantage of our approach is that we are able to clearly identify hazy regions. As explained in section 4, these regions are identified in a straightforward manner by observing the hue disparity between the image and its semi-inverse. In order to mask the most haze-opaque areas, we perform the same procedure, but with the intensity of the semi-inverse increased by a factor ξ (with a default value of ξ = 0.3). During our experiments, we found that for images where the sky is present, the resulted mask contains mostly the sky region, which decreases
508
C.O. Ancuti et al.
the searching space. The extraction of the airlight color vector A∞ is performed by determining the brightest pixel only in the positive (unmasked) region (see Fig. 3). The winning value of A∞ is extracted from the original foggy image from the same location as the brightest pixel. This approach has shown to be more robust than only searching for the most brightest pixels in the entire image. 5.2
Layer-Based Dehazing
The contrast within an dehazed image is directly correlated with the estimated airlight A and the inferred transmission t(x) at each point. Due to the physical additive property of the airlight (see equation 1), it is possible to estimate the direct attenuation D once the A∞ is known, by varying A = A∞ ∗ (1 − t (x)) in all possible values of its range [2]. Previous approaches have introduced many constraints and cost functions that favor certain image characteristics based on local image patches, thus limiting this range, and making it possible to compute an approximate transmission map. In previous strategies, the transmission map is commonly refined further using an energy minimization approach, based on the assumption that for local neighborhoods the airlight shows only very minor deviations (an assumption that breaks down at depth discontinuities). The main drawback of such approaches is that the employed search methods, even though they are commonly very expensive, are unable to ensure an accurate transmission map. In contrast, we present a fast method which segments the image in regions that do not contain abrupt depth discontinuities. Our strategy was inspired by the approach of Narasimhan and Nayar [9], which aims to normalize the pixels in so-called iso-depth neighborhoods. However, the previous method requires two pictures in order to identify such regions for their normalization operation. When using multiple images, it becomes possible to identify such isodepth neighborhoods, as they are invariant to the weather conditions and do not contain sudden depth discontinuities [9]. There are many possible strategies to create a dehazed image after a per pixel identification of hazy regions. In this work, we propose a layer-based method, which aims at preserving a maximum amount of detail, while still retaining sufficient speed. We initiate our algorithm by creating several new images Ii , with i ∈ [1, k] and k layers, in which we remove a decreasingly growing portion of the airlight constant color A∞ from the initial hazy image I: Ii = I − ci · A∞ .
(6)
with the iteratively increasing airlight contribution factor ci . Parameter ci is a scalar value in the range [0, 1] with its value depending by the number of layers. After applying our haze detection operation on Ii , only the pixels with a sufficiently low hue disparity are labeled as being part of layer Li . In the absence of the scene geometry, discretization of the image in k distinct layers enables us to estimate the values of ci that correspond to the most dominant depth layers of the scene. For instance, when the scene contains two objects located at different depths, the transmission map will be characterized by two dominant values (as the airlight is correlated with the distance). Finally, these layers are blended into
A Fast Semi-inverse Approach to Detect and Remove the Haze
509
Fig. 5. Layer-based dehazing. Top line: the initial foggy image; the rough transmission map that corresponds to I0 ; the result of a naive method which simply pastes all layers Li upon each other, introducing artifacts; the result of our method, which applies soft blending of the layers. Middle line: the mask regions for each layer. Bottom line: the computed layers Li .
a single composite haze-free image. In order to smooth the transitions between the different layers, the number of extracted layers k needs to be at least 5 or more. As can be observed in Fig. 5 every layer (except the first one) includes the pixels of the previous layer, but with different levels of attenuation. The fast generated results in our paper have been yielded using a default value of 5 layers and the values for ci : [0.2; 0.4; 0.6; 0.8; 1]. To obtain the haze-free image I0 , the layers are blended in the descending order of the airlight contribution. A naive approach would consist of simply copying the pixels from each layer on the next, but this might generate unpleasing artifacts due to small discontinuities (see Fig. 5). In order to remove such undesirable transitions artifacts, each layer will contribute a small percentage onto the next layer, according to the following equation: k χi Li . (7) I0 = i=1
where the scalar parameter χi weights the contribution of the layers pixels, increasing exponentially according with the layer number. Splitting the input image into non-uniform neighborhoods that contains approximately uniform airlight generates good results, even when using only a single image. In comparison, algorithms that are based upon fixed-size patches may introduce artifacts because these uniform patches do not consider the intensity distribution of the small region. Moreover, our straightforward pixel based strategy is computationally effective overcoming the existing single image dehazing approaches (see the next section for comparative processing times).
510
C.O. Ancuti et al.
Fig. 6. Our technique is able to identify hazy regions on a per pixel basis. Comparing the haze mask produced by our technique to the dark channel mask of He et al. [3] clearly shows that we are able to preserve significantly more detail.
6 6.1
Discussion Haze Detection
As we have stated in the introduction, we believe to be the first to present an algorithm for per pixel haze detection. It could be stated that the algorithm of He et al. [3], an extension of the dark object technique common in the remote sensing community [6], could also be regarded as a method for haze area detection. However, in order to create the dark channel, their technique employs a patch-based approach which is unable to properly preserve any detail. When taking a closer look at Fig. 6, the patch-like structure of the dark channel images immediately becomes apparent. The black regions associated with haze-free areas do not reflect the true haze-free area borders. It is important to note that these transitions need to be recovered properly, as edges between two regions characterized by large illumination differences can generate prominent halo artifacts. This is the result of using patches with a constant dimension, which do not consider the intensity distribution of the small region. In constrast, our method employs a pixel-wise multi-layer strategy, which decomposes the image in regions that are characterized by small illumination transitions. This approach does not suffer from halo artifacts, because the regions are non-uniform and respect the image intensity distribution. 6.2
Haze Removal
We have tested our approach on a large data set of natural hazy images. Figure 7 illustrates results (our dehazed image and the computed transmission map) obtained for three foggy images by our technique, compared to the methods of Tan [2],He et al. [3] and Kratz and Nishino [5]. All the figures presented in this paper and supplementary material contain the original restored images provided by the authors. As can be observed, we are able to enhance the images while retaining even very fine details. Furthermore, our method accurately preserves the color of the objects in the scene. Another set of images is provided in Fig. 9, where the top lines show comparative results obtained by the techniques of Fattal [1], He et al. [3], Tarel and Hautiere [4] and our method. It should be noted
A Fast Semi-inverse Approach to Detect and Remove the Haze
511
Fig. 7. Comparative haze removal results. From left to right: initial foggy images, our restored results, our estimated transmissions (depth map) and the corresponding results obtained by Tan [2], He et al. [3] and Kratz and Nishino [5] respectively.
Fig. 8. From left to right: initial foggy image, the result obtained by Deep Photo [15] that employs additionally an approximated 3D model of the scene, the result of Tarel and Hautiere [4] and our result.
that in the discussion above, we have limited ourselves to images in which the scene is sufficiently illuminated. Even though the method performs generally well, for poorly lit scenes (an extreme case of such problem), like previous single image dehazing techniques, our approach may have trouble to accurately detect hazy regions. This limitation is due to the hypothesis of the considered optical model that assumes that regions characterized by small intensity variations contain no depth discontinuities. As previously mentioned our approach has the advantage to perform faster than related algorithms. Our method implemented on CPU (Intel 2 Duo 2.00 GHz) processes an 600 × 800 image in approximately 0.013 seconds being suitable for real-time outdoor applications. In comparison, the technique of Tarel and
512
C.O. Ancuti et al.
Fig. 9. Evaluation of the results using the IQA metric. The top two lines present the comparative results obtained by Fattal [1], He et al. [3], Tarel and Hautiere [4] and our method. The bottom two lines show the results after applying the IQA between initial image and the restored version. The left bottom table presents the ratio of the color pixels (with a probability scale higher than 70%) counted for each method.
Hautiere [4] processes an 480 × 600 image in 0.2 seconds. However, as can be observed in Fig. 8 the patch-based method of Tarel and Hautiere [4] risks to introduce artifacts close to the patch transitions but also to distort the global contrast. The method of Tan [2] requires more than 5 minutes per image while the technique of Fattal [1] computes an 512 × 512 image in 35 seconds. The algorithm of He et al. [3] takes approximately 20 seconds per image while the computation times of the techniques of Kratz and Nishino [5] were not reported. 6.3
Evaluating Dehazing Methods
Since there is no specialized evaluation procedure of the dehazing techniques we searched the recent literature for an appropriate method for this task. Tarel and Hautiere [4] evaluate the quality of dehazing techniques based on a visibility
A Fast Semi-inverse Approach to Detect and Remove the Haze
513
resaturation procedure [19]. Because this procedure only applies to grayscale images and is mainly focused on finding the most visible edges, we searched for a more general method that is able to perform a pixelwise evaluation of the dehazing process. In this work, we have employed the Image Quality Assessment (IQA) quality measure introduced recently by Aydin et al. [20], which compares images with radically different dynamic ranges. This metric, carefully calibrated and validated through perceptual experiments, evaluates both the contrast and the structural changes yielded by tone mapping operators. The IQA metric is sensitive to three types of structural changes: loss of visible contrast (green) - contrast that was visible in the reference image is lost in the transformed version, amplification of invisible contrast (blue) - contrast that was invisible in the reference image becomes visible in the transformed image, and reversal of visible contrast (red) - contrast visible in both images, but with different polarity. As a general interpretation, contrast loss (green) has been related with image blurring, while contrast amplification (blue) and reversal (red ) have been connected to image sharpening. Since these modifications are closely related to our problem, we have found this measure to be more appropriate as a means to evaluate the resaturation of hazy regions after applying different dehazing techniques. Figure 9 shows the comparative results of applying the IQA metric on two foggy images and their dehazed versions, using the method of Fatal [1], He et al. [3], Tarel and Hautiere [4] and ours. The bottom-left table of Fig. 9 displays the comparative ratios of the (colored) pixels yielded by the IQA measure when applied to the results of the considered dehazing methods. Following the recommendation of Aydin et al. [20], in order to reduce the possibility of misclassification, only the pixels with a probability scale higher than 70% have been considered. Based on the results from the table, it becomes clear that compared with the other techniques, the structural changes yielded by our algorithm are more closely related to sharpening operations (blue and red pixels) and less related with blurring (green pixels).
7
Conclusions
In this work, we have presented a single-image dehazing strategy which does not make use of any additional information (e.g. images, hardware, or available depth information). Our approach is conceptually straightforward. Based on a per pixel hue disparity between the observed image and its semi-inverse, we are able to identify the hazy regions of the image. After we have identified these regions, we are able to produce a haze-free image using a layer-based approach. The processing time of our technique is very low, even when compared to previous methods who were designed and optimized for speed. In the future, we will be investigating a more comprehensive optical model, as well as extending our work to the problem of video dehazing.
514
C.O. Ancuti et al.
References 1. Fattal, R.: Single image dehazing. ACM Transactions on Graphics, SIGGRAPH (2008) 2. Tan, R.T.: Visibility in bad weather from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (2008) 3. He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. In: IEEE Conference on Computer Vision and Pattern Recognition (2009) 4. Tarel, J.P., Hautiere, N.: Fast visibility restoration from a single color or gray level image. In: IEEE International Conference on Computer Vision (2009) 5. Kratz, L., Nishino, K.: Factorizing scene albedo and depth from a single foggy image. In: IEEE International Conference on Computer Vision (2009) 6. Chavez, P.: An improved dark-object subtraction technique for atmospheric scattering correction of multispectral data. Remote Sensing of Environment (1988) 7. Moro, G.D., Halounova, L.: Haze removal and data calibration for high-resolution satellite data. Int. Journal of Remote Sensing (2006) 8. Narasimhan, S., Nayar, S.: Chromatic Framework for Vision in Bad Weather. In: IEEE Conference on Computer Vision and Pattern Recognition (2000) 9. Narasimhan, S., Nayar, S.: Contrast Restoration of Weather Degraded Images. IEEE Trans. on Pattern Analysis and Machine Intell. (2003) 10. Schaul, L., Fredembach, C., Ssstrunk, S.: Color image dehazing using the nearinfrared. In: IEEE Int. Conf. on Image Processing (2009) 11. Treibitz, T., Schechner, Y.Y.: Polarization: Beneficial for visibility enhancement? In: IEEE Conference on Computer Vision and Pattern Recognition (2009) 12. Shwartz, S., Namer, E., Schechner, Y.: Blind haze separation. In: IEEE Conference on Computer Vision and Pattern Recognition (2006) 13. Namer, E., Shwartz, S., Schechner, Y.: Skyless polarimetric calibration and visibility enhancement. Optic Express, 472–493 (2009) 14. Narasimhan, S., Nayar, S.: Interactive de-wheathering of an image using physical models. In: ICCV Workshop CPMVC (2003) 15. Kopf, J., Neubert, B., Chen, B., Cohen, M., Cohen-Or, D., Deussen, O., Uyttendaele, M., Lischinski, D.: Deep photo- Model-based photograph enhancement and viewing. ACM Transactions on Graphics (2008) 16. Koschmieder, H.: Theorie der horizontalen sichtweite. In: Beitrage zur Physik der freien Atmosphare (1924) 17. Tao, L., Yuan, L., Sun, J.: SkyFinder: Attribute-based Sky Image Search. ACM Transactions on Graphics, SIGGRAPH (2009) 18. Henry, R.C., Mahadev, S., Urquijo, S., Chitwood, D.: Color perception through atmospheric haze. Opt. Soc. Amer. A 17, 831–835 (2000) 19. Hautiere, N., Tarel, J.P., Aubert, D., Dumont, E.: Blind contrast enhancement assessment by gradient ratioing at visible edges. Img. Anal. and Stereology (2008) 20. Aydin, T.O., Mantiuk, R., Myszkowski, K., Seidel, H.S.: Dynamic range independent image quality assessment. In: ACM Trans. Graph., SIGGRAPH (2008)
Salient Region Detection by Jointly Modeling Distinctness and Redundancy of Image Content Yiqun Hu, Zhixiang Ren, Deepu Rajan, and Liang-Tien Chia School of Computer Engineering, Nanyang Technological University, Singapore {yqhu,renz0002,asdrajan,asltchia}@ntu.edu.sg
Abstract. Salient region detection in images is a challenging task, despite its usefulness in many applications. By modeling an image as a collection of clusters, we design a unified clustering framework for salient region detection in this paper. In contrast to existing methods, this framework not only models content distinctness from the intrinsic properties of clusters, but also models content redundancy from the removed content during the retargeting process. The cluster saliency is initialized from both distinctness and redundancy and then propagated among different clusters by applying a clustering assumption between clusters and their saliency. The novel saliency propagation improves the robustness to clustering parameters as well as retargeting errors. The power of the proposed method is carefully verified on a standard dataset of 5000 real images with rectangle annotations as well as a subset with accurate contour annotations.
1
Introduction
Detecting visually salient regions from an image has been proved to be useful in various multimedia and vision applications like content based image retrieval [1, 2], image adaptation [3, 4], and image retargeting [5]. The success of these applications that use image saliency can be attributed to the fact that saliency can allow the application algorithm to focus on processing the ‘important’ regions only. However, salient region detection itself is still a challenging problem since the visual attention mechanism of the human visual system has not been fully understood. The ambiguity of a precise notion of what is visual saliency compounds the problem. Since it is not possible to exhaustively discuss all the saliency detection methods available in the literature, we highlight some of the recent reported methods that have been shown to be effective. Feature contrastbased methods e.g. [6, 7] model saliency as different forms of centre-surround contrasts. Information-based methods apply local entropy [8], information gain [9] and incremental coding length [10] to model saliency. Recently, saliency is also proposed to be modeled in the frequency domain. For example, [11] calculate spectral residual in an image to indicate saliency. Similarly, [12, 13] define saliency as the difference between a frequency-tuned image and its mean value. Instead of modeling saliency in a local context, some saliency detection techniques model saliency in a global sense; these include robust subspace method R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 515–526, 2011. Springer-Verlag Berlin Heidelberg 2011
516
Y. Hu et al.
[14], graph-based saliency [15] and the salient region detection method of [16] which models color and orientation distributions. Moreover, [17] combine local features (saliency) and the global (scene-centered) features in a Bayesian framework to predict the regions fixated by observers. In this paper, we wish to investigate if the knowledge of redundant content can help to improve the accuracy of salient region detection. This knowledge can be obtained from image retargeting technique e.g. [18] which generates a retargeted image of specific size from the input image by maximizing the similarity of any pair of neighboring pixels between input and output. In this process, the pixels that are removed are redundant in the local context. Many image retargeting methods [5, 19] utilize the saliency map output from salient region detection methods to identify important regions that should not be removed. If the saliency map correctly identifies salient regions, retargeting results can be improved. However, generating a good saliency map itself is a challenge. Instead, the retargeting method [18] that does not use any saliency map focuses on removing the redundant content to generate a retargeted image with less distortion. Inspired by this process, we believe that redundant information can also assist in saliency detection. The difference as well as the similarity between retargeting and salient region detection motivate the work in this paper. We propose a clustering framework to detect salient regions by jointly modeling the distinctness and redundancy of image content. An image is first modeled as a collection of color clusters by fitting a Gaussian Mixture Model (GMM). Saliency for each cluster is initialized by combining the distinctness and the redundancy of cluster data. The distinctness is estimated from intrinsic color properties and the redundancy is estimated from the removed content of retargeting process. This initial saliency model is further improved by propagating cluster saliency according to the clustering assumption that forces similar clusters to have similar visual saliency. An analytical solution is given to obtain the final cluster saliency. The novel saliency propagation improves the robustness to over-clustering as well as retargeting errors. The saliency of image pixels can be computed from the cluster saliency in a probabilistic manner. Compared with existing methods for salient region detection, two major contributions of this paper are as follows: – To the best of our knowledge, the proposed method is the first one to utilize retargeting result to incorporate redundancy for salient region detection; – The proposed saliency propagation enforces the clustering assumption on clusters and saliency, which improves the robustness to clustering and retargeting error; The performance of the proposed method is tested on a standard dataset of 5000 images in [20] using rough rectangle-based annotations. We also conduct a careful comparison of different methods on a subset of 1000 images [12] with accurate contour-based annotations. The rest of this paper is organized as follows. After introducing the GMM clustering of image data in Sec 2, the initialization of cluster saliency is described
Salient Region Detection
517
in Sec 3, which includes both distinctness and redundancy estimations. We then propose the propagation of cluster saliency in Sec 4 which can be solved analytically. In Sec 5, a comprehensive evaluation and comparison on the standard dataset is conducted. Finally, the proposed method is concluded in Sec 6.
2
Probabilistic Clustering by GMM
Given an image, we directly model its color content by clustering the RGB color values of all pixels using a Gaussian Mixture Model (GMM). By using k-means clustering, we initialize cluster centers as well as the corresponding covariance matrices. The data fitting for GMM is done using the standard Expectation Maximization (EM) algorithm [21]. Starting from initial values, wk , μk and Σk are iteratively updated until convergence as follows:
where
μk =
N 1 γk (xi )xi , Nk i=1
Σk =
N 1 γk (xi )(xi − μk )(xi − μk )T , Nk i=1
wk =
Nk , N
wk · g(xi |μk , Σk ) γk (xi ) = K k=1 wk · g(xi |μk , Σk )
(1)
(2)
N represents the probability of xi belonging to the k th cluster and Nk = i=1 γk (xi ). Note that each xi is soft assigned to the k th cluster with probability of γk (xi ).
3
Cluster Saliency Initialization
After modeling multiple clusters from image content, we initialize saliency for each of them. As mentioned before, whether a cluster is salient can be estimated from two aspects - the distinctness as well the redundancy of its content. A salient cluster is supposed to be distinct from other clusters and contains few redundancy. Otherwise, it is not salient. We will describe how to model these two aspects for every cluster in the rest of this section. 3.1
Distinctness Estimation
We follow [16] to model the distinctness of a cluster from its color properties. Specifically, the cluster isolation in the feature space and the compactness in the
518
Y. Hu et al.
spatial domain are used. Feature isolation measures the distinction of a cluster with respect to others. The inter-cluster distances in feature space are computed as |x − μk |2 · γi (x) f , (3) dik = x∈X x∈X γi (x) where X is the set of all RGB data points. The isolation of the ith cluster is the sum of its inter-cluster distances to all clusters, i.e., ISOi =
K
dfik .
(4)
k=1
The spatial distribution of each cluster is another cue for measuring its distinctness. The less a cluster is spread, the higher its distinctness will be. For example, the clusters corresponding to the background are often spread out. Similar to isolation, the inter-cluster distances in spatial domain are computed as s s 2 s s |x − μk | · γi (x) (5) dsik = x ∈X xs ∈X s γi (x) where xs indicates the spatial coordinates of data x and μsk indicates the spatial center of k th cluster. The compactness of a cluster is then defined as K COM Pi = ( dsik )−1 .
(6)
k=1
The distinctness of ith cluster is finally defined as Disi = ISOi · COM Pi . 3.2
(7)
Redundancy Estimation
As mentioned earlier, image retargeting provides some knowledge about the redundancy of image content. This process maximizes the similarity between retargeted image and source image by detecting and removing locally redundant content. In this section, we analyze the distribution of removed content among all the clusters to measure the redundancy of image content. We obtain the set of M removed pixels X r = {xr1 , xr2 , · · · , xrM } by applying the retargeting method of [18] without using any saliency map. Note that different from [5] which defines an implicit ‘saliency’ from image gradients, [18] retargets image by optimizing the coordinate shifts of pixels in retargeted image instead of using saliency. When the shift of some pixel is non-zero, some redundant content in the local context are removed from source image, which may have either large or small gradient. The retargeting process is carried out to retarget image in both dimensions of width and height. Suppose Nr pixels are removed from each dimension, the total number of removed pixels is calculated as M = Nr ∗ W + Nr ∗ (H − Nr ), where W and H are the width and height of the source
Salient Region Detection Input Image
519
Retargeted Image Retargeting
Mask of removed pixels
Fig. 1. Illustration of retargeting process and the mask of removed pixels where the bright pixels are the pixels removed during retargeting
image respectively. The M removed pixels are used to estimate the redundancy of every cluster. These pixels correspond to the redundant content in the image. Figure 1 illustrates this retargeting process, where removed pixels are denoted as bright pixels in the mask. Because the retargeting method is not perfect, we only slightly reduce the image size to obtain the most redundant M pixels in the image. This will improve the accuracy of the redundant information output from retargeting process. Although it may occur that some of the removed pixels actually belong to salient region, the redundancy of clusters can still be correctly estimated by saliency propagation, which will be introduced in the next section. Every removed pixel xrm is distributed to K color clusters according to the probabilities γi (xrm ) from (2). By assigning all xrm into clusters, we can initialize the redundancy of ith cluster as M γi (xrm ) Redi = K m=1 . (8) M r k=1 m=1 γk (xm ) The more removed pixels belong to a cluster, the more redundant the cluster is. 3.3
Initial Cluster Saliency
The initialized saliency of a cluster is designed as a combination of its distinctness and redundancy. Note that the contribution to saliency of distinctness (Dis) and redundancy (Red) is different. According to the inverse relationship between Dis and Red for visual saliency, we linearly combine them to unify cluster saliency: Si = Disi − α · Redi .
(9)
min (Dis) We choose the combination factor α = max (Red) . This ensures that all Si are nonnegative and normalizes Red in the range [0, · · · , min (Dis)]. It is noteworthy that this combination preserves the confidences of both measures. For example, if there is a very distinct salient cluster with a very high Dis value, then equation (9) will suppress the effect of Red in the combination because the highest Dis is
520
Y. Hu et al.
much higher than min (Dis). If there is no obviously distinct cluster in terms of Dis, the saliency value Dis for all clusters will be similar and therefore, equation (9) will increase the contribution of Red.
4
Cluster Saliency Propagation
The initial cluster saliency is not accurate enough to be directly used to compute visual saliency of pixels because of two reasons. First, how to decide the number of clusters in GMM is an open question such that over-clustering often occurs. If a color cluster is separated into multiple sub-clusters, distinctness can not be accurately estimated. Second, since only a small number of pixels are removed by the retargeting process, the estimation of redundancy may be incomplete. For some non-salient clusters, only a few of the M removed pixels belong to it and its redundancy is underestimated. In this section, we introduce the graph-based propagation to solve this incomplete modeling. 4.1
Graph-Based Propagation
The idea behind graph-based propagation is to enforce the clustering assumption, i.e. similar clusters should have similar visual saliency. We build a complete graph G, where the ith vertex indicates the ith color cluster and each edge represents the pairwise similarity between the two corresponding color clusters. The pairwise cluster similarity can be obtained from the affinity matrix A of this graph, which is defined as follows: 2 Aij = e−
|μi −μj | 2σ2
,
(10)
where σ is fixed as 255 and μ denotes the color mean of every cluster. We formulate the propagation of cluster saliency as an optimization problem. Denoting the final cluster saliency as S˜ = [S˜1 , S˜2 , · · · , S˜K ]T , the following objective function ˜ is minimized to optimize S: min
K K i=1 j=1
(S˜i − S˜j )2 Aij + λ ·
K
(S˜k − Sk )2 .
(11)
k=1
The first part of (11) indicates the optimal S˜ needs to satisfy clustering assumption according to the affinity matrix A, while the second part of (11) indicates the optimal S˜ needs to be as close as possible to the initial saliency S. The parameter λ controls the trade-off between these two parts. For all the experiments in this paper, we set λ = 1. This optimization can be directly solved by setting the derivative with respect to S˜ equal to 0. The corresponding solution is S˜ = λ · (L + λI)−1 S
(12)
where L = D−A is the unnormalized Laplacian matrix of the graph G calculated from the affinity matrix A and the diagonal degree matrix D (Dii = j Aij ). The principle to achieve optimal S˜ can be explained as propagating initial cluster saliency S according to the similarity between clusters.
Salient Region Detection
4.2
521
Pixel Saliency
Once the optimal cluster saliency S˜ is estimated, it is straightforward to compute the visual saliency of every pixel. Given a pixel, we can accumulate the saliency contributions from all clusters according to its cluster distribution γ(xi ) = [γ1 (xi ), . . . , γK (xi )]T : (13) P S(xi ) = S˜T · γ(xi ) The more a pixel belongs to salient clusters, the higher its visual saliency P S is. By normalizing all P S into the range of [0, 1], the saliency map of the image can be generated.
5
Experiment Evaluation
To evaluate the performance of the proposed method, we compare it with several methods including Itti’s method (ITTI) [6], Spectral Residual method (SR) [11], Graph-based Saliency method (GB) [15], the Incremental Coding Length method (ICL) [10], Frequency-tuned method (FT-Matlab, FT-C) [12, 13] and the salient region detection method based on color and orientation distributions (COD) [16] for salient object detection. Note that the same color feature is used for all methods except for Itti’s method. This allows a fair comparison across different methods without inheriting effects of different features. Furthermore, we binarize the saliency maps using Otsus thresholding technique. For efficiency, our method first downsizes the input image to 10%, then performs saliency detection, finally scales the saliency map to original size for evaluation. In all the experiments, we fix the number of clusters K = 10 and the number of removed pixels in each dimension Nr = 5. Note that the performance of our method is not sensitive to the parameter settings. (e.g. when changing K from 7 to 13, Nr = 5 or Nr = 8, or the resizing factor to be 20%, the performance only varies slightly.) Such robustness is enhanced by the proposed saliency propagation. The first experiment is conducted on MSRA Salient Object Database (Set B) proposed in [20]. It contains 5, 000 images as well as the corresponding rectangular annotations of salient region. For every image, the rectangle covering salient object is labeled by nine users. The median of these nine rectangles is used as the ground truth of salient region1. Similar to [12, 16], Precision, Recall and F-Measure (Eq. 14) are used to measure the goodness of binarized saliency maps according to the ground truth. F − M easureα =
(1 + α) · P recision · Recall (α · P recision + Recall)
(14)
where α is a positive real number controlling the importance of recall over precision. We set α = 0.3 similar to [12, 16] to weigh precision more than recall. The results are summarized in Table 1. We can see that the proposed algorithm is comparable to the state-of-the-art performance. It is interesting to note 1
The left/right/top/bottom edge of ground truth rectangle is the median of the corresponding edges of nine rectangles, respectively.
522
Y. Hu et al.
Table 1. Performance Evaluation on 5000 images based on rectangle ground truth ITTI SR
GB
ICL FT-Matlab FT-C COD Ours
Precision 0.584 0.643 0.679 0.763 Recall 0.551 0.324 0.638 0.204 F-Measure 0.576 0.524 0.669 0.467
0.543 0.289 0.452
0.595 0.698 0.723 0.290 0.482 0.509 0.478 0.632 0.660
that ICL [10] achieves the highest precision but with lowest recall among all methods. This indicates that it is only suitable to detect small salient objects. For relatively large objects in the dataset, only some of the object pixels can be assigned high saliency values. This problem of ICL method is illustrated in Figure 2 (c) (second row last column). The recalls of SR, FT-Matlab and FT-C are also relatively low. Without combining with image segmentation methods like [12, 13], only a small part of the salient object can be detected from the saliency maps generated from these methods. Taking the image in Figure 2 (e) as an example, SR [11] method only assigns high saliency values to the edges of two persons while the saliency map output from FT-Matlab and FT-C [12, 13] methods have high saliency values only on some parts of persons. Although presegmentation can help to solve this issue, the errors in pre-segmentation will affect final result despite how accurate the saliency map is. COD method [16] and the proposed method do not have this issue because the saliency is estimated for every color cluster. The saliency maps output from these two methods not only assign high saliency values to edges but also internal parts of salient objects. The proposed method outperforms COD by about 3% because of the former’s ability to model redundant content. In other words, our method not only inherit the advantages of COD (e.g. Figure 2(b),(d),(f)) but also better suppresses the background area (e.g. Figure 2(a),(c),(e)) due to the incorporation of redundant information. Some readers may notice that the F-measure of our method is slightly lower than GB[15]. There are two reasons. (1) As mentioned by [12, 22], the ground truth rectangle of this dataset is not as accurate as a saliency map that follows the object contour faithfully. The performance on this dataset measures how well an algorithm satisfies the rectangle constraint rather than the accurate object shape. For example, smoothing the detected salient regions [15] could often achieve higher F-measures. (2) The saliency map is required to be binarized before calculating precision, recall and F-measure. A specific binarization method, even if performed for all algorithms, may be biased to some of them, making the comparison unfair. These two limitations make the evaluation biased towards some method e.g. [15], although we believe the proposed method generates best saliency maps to indicate salient regions. To verify the true performances of different methods, the second experiment is conducted based on the accurate object-contour based ground truth database of 1, 000 images proposed in [12]2 . The exact shapes of salient objects are used as ground truth instead of inaccurate rectangles. To avoid the bias of binarization, 2
http://ivrg.epfl.ch/supplementary_material/RK_CVPR09/index.html
Salient Region Detection
(a)
(b)
(c)
(d)
(e)
(f)
523
Fig. 2. Examples of saliency maps generated from different methods. In every example, the original image, the ground truth contour mask and rectangle mask of salient object are shown in the first row. The second row shows the saliency maps generated from Itti’s method [6], SR method [11], GB method [15] as well as ICL method [10]. The third row shows the saliency maps generated from FT-Matlab, FT-C [12, 13], COD method [16] and our proposed method.
similar to [9, 15], the area under average Receiver Operating Characteristic (ROC) curve is used to estimate the performances of different methods. This metric thresholds the saliency map using different thresholding values. The average
524
Y. Hu et al.
AVERAGE RECIEVER−OPERATING CHARACTERISTICS (ROC) CURVE
1 0.9 0.8
TRUE POSITIVE RATIO
0.7 0.6 0.5 0.4 0.3
OURS FT−MATLAB FT−C ICL SR COD GB ITTI
0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
FALSE POSITIVE RATIO
0.7
0.8
0.9
1
Fig. 3. Average ROC curves of all methods on 1000 images with contour ground truths Table 2. Corresponding areas under average ROC curves of all methods Method ITTI SR GB ICL FT-Matlab FT-C COD Ours Area 0.787 0.768 0.872 0.852 0.838 0.884 0.875 0.916
ROC curve is obtained by finding the specific thresholding values to achieve a set of fixed false positive ratios and get the corresponding true positive ratios. Figure 3 shows the average ROC curves of all methods and the corresponding areas under average ROC curve are listed in Table 2. This experiment provides a reliable verification that the proposed method outperforms all other methods. In terms of Area Under average ROC Curve (AUC), the proposed method is the only one whose AUC is larger than 0.9. In terms of average ROC curve, the curve of the proposed method is consistently above those of other methods. These facts indicate that the saliency map output from the proposed method has high consistency with the accurate ground truth of images. It is also interesting to notice that both COD method and the proposed method achieve the highest true positive ratio when the the false positive ratio is 0. This indicates that more pixels belonging to a salient object have higher saliency values than those not belonging to salient objects. This is because the clustering framework allows to assign high saliency values to more salient object pixels as they belong to salient clusters with higher probabilities. When false positive ratio is increased, the average ROC curve of COD method is below that of the proposed method. This fact empirically proves the usefulness of redundant ( non-salient ) content modeling in the proposed method. When false positive ratio is fixed, more pixels belonging to salient objects can be obtained from the proposed method due to the suppression of non-salient parts in the images. To investigate the contribution of each component in our model, we conduct another experiment(denoted as “Dis+Red”) on the database of 1000 images. This model only combines the distinctness and the redundancy according to equation (9), and the AUC of it is 0.906. Compared this result with COD which only considers distinctness(“Dis”), the improvement of redundancy(“Red”) is 3 percentage
Salient Region Detection
525
points. While the comparison between model “Dis+Red” and complete model (“Dis+Red+Propagation”) illustrates the contribution of propagation process. The corresponding achievement is 1 percentage points. The improvement of saliency propagation seems not so significant because the error of redundancy is relatively small and the clustering results in our experiments are almost satisfactory. When heavy over-clustering occurs and there is some large error in redundancy estimation, the improvement due to saliency propagation will become significant. In summary, both the propagation process and the redundancy can improve the saliency detection performance. Although the proposed method clearly outperforms latest state-of-the-art methods, it can be further improved in two ways. First, we only use RGB features in this paper, which may miss some salient objects similar to background in color. However, the proposed method is a general framework that can incorporate different features. More advanced features e.g. texture features, geometric features, can be used to detect complex salient object. Second, the redundant information provided by retargeting process are not always accurate. These errors may degrade the performance of salient object detection in some cases. However, with the saliency map generated from the proposed method, the retargeting process can be further improved, which in turn enhances saliency detection with the help of more accurate redundant content.
6
Conclusion
In this paper, we propose a salient region detection method from only color information by joint modeling of both distinctness and redundancy. The distinctness identifies salient objects by modeling intrinsic properties (isolation and compactness) of clusters. In contrast, the redundancy suppresses the background by utilizing the removed content of the retargeting process. To the best of our knowledge, this is the first method in which redundancy information is used to model visual saliency. The combination of distinctness and redundancy is then propagated among color clusters to enforce the clustering assumption on cluster saliency. The graph-based propagation can not only handle over-clustering but also tolerate incomplete estimation of redundancy for each cluster. A careful evaluation and comparison on the standard dataset empirically proves that the proposed method outperforms the state-of-the-arts for salient object detection, which illustrates the power of joint modeling of distinctness and redundancy. Acknowledgement. This research was supported by the Media Development Authority under grant NRF2008IDM-IDM004-032.
References 1. Wang, X.J., Ma, W.Y., Li, X.: Data-Driven Approach for Bridging the Cognitive Gap in Image Retrieval. In: Proceedings of the 2004 IEEE International Conference on Multimedia and Expo., Taipei, Taiwan, vol. 3, pp. 2231–2234 (2004) 2. Fu, H., Chi, Z., Feng, D.: Attention-driven Image Interpretation with Application to Image Retrieval. Pattern Recognition 39, 1604–1621 (2006)
526
Y. Hu et al.
3. Chen, L.Q., Xie, X., Fan, X., Ma, W.Y., Zhang, H.J., Zhou, H.Q.: A Visual Attention Model for Adapting Images on Small Displays. ACM Multimedia Systems Journal (9) 353–364 4. Xie, X., Liu, H., Ma, W.Y., Zhang, H.J.: Browsing Large Pictures under Limited Display Sizes. IEEE Transaction on Multimedia 8, 707–715 (2006) 5. Avidan, S., Shamir, A.: Seam Carving for Content-Aware Image Resizing. ACM Transactions on Graphics (Proceedings SIGGRAPH 2007) 26 (2007) 6. Itti, L., Koch, C., Niebur, E.: A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1254–1259 (1998) 7. Ma, Y.F., Zhang, H.J.: Contrast-based Image Attention Analysis by Using Fuzzy Growing. In: Proceedings of the Eleventh ACM International Conference on Multimedia, New York, NY, USA, pp. 228–241 8. Kadir, T., Brady, M.: Scale, Saliency and Image Description. International Journal of Computer Vision 45, 83–105 (2001) 9. Bruce, N., Tsotsos, J.: Saliency Based on Information Maximization. In: Advances in Neural Information Processing Systems 18 10. Hou, X., Zhang, L.: Dynamic Visual Attention: Searching for Coding Length Increments. In: Advances in Neural Information Processing Systems 21 11. Hou, X., Zhang, L.: Saliency Detection: A Spectral Residual Approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, Minnesota, USA (2007) 12. Achanta, R., Estrada, S.H.F., Susstrunk, S.: Frequency-tuned Salient Region Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Florida, USA (2009) 13. Achanta, R., Susstrunk, S.: Saliency Detection for Content-aware Image Resizing. In: Proceedings of the IEEE Conference on Image Processing, Cairo, Egypt (2009) 14. Hu, Y., Rajan, D., Chia, L.T.: Robust Subspace Analysis for Detecting Visual Attention Regions in Images. In: Proceedings of the Thirteenth ACM International Conference on Multimedia, Singapore, pp. 716–724 15. Harel, J., Koch, C., Perona, P.: Graph-Based Visual Saliency. In: Advances in Neural Information Processing Systems 19 16. Gopalakrishnan, V., Hu, Y., Rajan, D.: Salient Region Detection by Modeling Distributions of Color and Orientation. IEEE Transaction on Multimedia 11, 892– 905 (2009) 17. Torralba, A., Oliva, A., Castelhano, M.S., Henderson, J.M.: Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review 113, 766–786 (2006) 18. Pritch, Y., Venaki, E.K., Peleg, S.: Shift-Map Image Editing. In: Proceedings of the IEEE International Conference on Computer Vision (2009) 19. Wang, Y.S., Tai, C.L., Sorkine, O., Lee, T.Y.: Optimized Scale-and-Stretch for Image Resizing. ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH ASIA 2008) 27 (December 2008) 20. Liu, T., Sun, J., Zheng, N.N., Tang, X., Shum, H.Y.: Learning to Detect A Salient Object. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, Minnesota, USA (2007) 21. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via The EM Algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977) 22. Wang, Z., Li, B.: A Two-stage Approach to Saliency detection. In: ICASSP, Las Vegas, Nevada, U.S.A (2008)
Unsupervised Selective Transfer Learning for Object Recognition Wei-Shi Zheng, Shaogang Gong, and Tao Xiang School of Electronic Engineering and Computer Science Queen Mary University of London, London E1 4NS, UK {jason,sgg,txiang}@dcs.qmul.ac.uk
Abstract. We propose a novel unsupervised transfer learning framework that utilises unlabelled auxiliary data to quantify and select the most relevant transferrable knowledge for recognising a target object class from the background given very limited training target samples. Unlike existing transfer learning techniques, our method does not assume that auxiliary data are labelled, nor the relationships between target and auxiliary classes are known a priori. Our unsupervised transfer learning is formulated by a novel kernel adaptation transfer (KAT) learning framework, which aims to (a) extract general knowledge about how more structured objects are visually distinctive from cluttered background regardless object class, and (b) more importantly, perform selective transfer of knowledge extracted from the auxiliary data to minimise negative knowledge transfer suffered by existing methods. The effectiveness and efficiency of the proposed approach is demonstrated by performing one-class object recognition (object vs. background) task using the Caltech256 dataset.
1
Introduction
Object recognition in unconstrained environments is a hard problem largely due to the vast intra-class diversities in object forms and appearance. Consequently, to learn a classifier for recognition, one typically needs hundreds or even thousands of training samples in order to account for the visual variability of objects within each class [1]. However, labelling large number of training samples is not only expensive but also not always viable because of difficulties in obtaining data samples for rare object classes. Inspired by human’s ability to learn new object categories with very limited samples [2], recent studies have started to focus on transfer learning [3–8], with which a model is designed to transfer object class knowledge from previously learned models to newly observed images of either the same object class or different classes. The overall aim of transfer learning is to maximise available information about an object class given very sparse training samples by utilising shared and relevant knowledge from other object classes and/or the same class of significant intra-class appearance variation. Two nontrivial challenges surface: how to quantify and compute ‘shared relevant’ interand intra-class object appearance knowledge, and under what assumptions. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 527–541, 2011. Springer-Verlag Berlin Heidelberg 2011
528
W.-S. Zheng, S. Gong, and T. Xiang
In this paper, we consider the object transfer learning for the extensively studied one-class recognition [3, 7, 9], which learns a model that is able to detect different and unknown object categories (target categories) against cluttered background. We assume that only very limited target training samples but a number of unlabelled images of other non-background object categories (auxiliary categories) are available, where the “unlabelled” means the exact labels of auxiliary data and the relationships such as the hierarchical category structure between the auxiliary and target categories are unknown. The proposed model addresses two significant limitations of existing transfer learning methods. First, most existing methods require that auxiliary classes are labelled and the relationships between target and auxiliary classes are known a priori, e.g. cross-domain but from the same categories [10–13] or cross-category but relevant using hierarchy category structure [9] (e.g. detecting giraffes using other four-leg animals as auxiliary data). However, identifying the relationships between a target category and auxiliary categories is non-trivial and not always possible – relevant object categories with shared knowledge to the target class may not be available in the auxiliary data. Although a few recent studies demonstrate that this assumption can be relaxed [3, 7], these methods are fully supervised so the problem is somewhat averted rather than solved as they all require a costly exhaustive labelling of all the auxiliary data to mitigate unknown relationships. However, exhaustive labelling is not always available nor viable. Second, no measures are taken by existing transfer learning techniques to avoid negative knowledge transfer [13, 14], which reduces rather than enhances the performance of an object recogniser learned from target class samples alone. Clearly there is a need for quantifying the usefulness of auxiliary knowledge and therefore explicitly selecting the relevant one in order to significantly alleviate negative transfer. To overcome these limitations of existing models, we propose an unsupervised transfer learning model capable of 1) identifying automatically transferrable knowledge to be extracted if both class labels of the auxiliary data and any relationships between the auxiliary and target object classes are unknown, and 2) more important, quantifying and selecting the most effectively relevant auxiliary knowledge (not all are relevant) and combining them with target specific knowledge learned from very limited training samples in constructing optimal target object recognisers. To that end, our unsupervised transfer learning is formulated in a novel Kernel Adaptation Transfer (KAT) learning framework. For the first objective, we exploit an observation that regardless how dissimilar the auxiliary object categories may be from a target category, there is shared general knowledge (structural characteristics) of objects, which are visually distinctive from less structured and more random background clutters. To extract such general knowledge from unlabelled auxiliary data, we perform clustering on the auxiliary data and the general knowledge is computed as an ensemble of local auxiliary knowledge that distinguishes objects from each cluster from cluttered background. For the second objective, we represent the extracted auxiliary knowledge using multiple kernel functions, and combine them with
Unsupervised Selective Transfer Learning for Object Recognition
529
any target specific knowledge represented by a kernel function learned directly from a handful of target class samples. Critically, since the auxiliary data classes may be ineffective to the target class which we have no knowledge of a priori, the usefulness of all auxiliary kernel functions is evaluated automatically and only relevant kernels are selected by our transfer kernel adaptation framework, in which a hypothesis constraint between the weight of target kernel and the weights of auxiliary kernels and a partial sparsity constraint are introduced in order to minimise negative knowledge transfer. In summary, the proposed unsupervised kernel adaptation transfer learning model makes two main contributions: (1) it provides a novel way of extracting general auxiliary knowledge from unlabelled auxiliary categories; (2) it significantly alleviates negative transfer due to its ability of quantifying and selecting transferrable knowledge. Our results on Caltech256 dataset [15] demonstrate that: (a) Our method is able to extract selectively useful transferrable knowledge from unlabelled auxiliary data to improve the recognition performance of each individual target object model trained by very limited samples. The selection is different for different target classes. (b) Our unsupervised transfer learning model outperforms existing unsupervised transfer learning model in terms of recognition performance, robustness to negative transfer and computational cost; and is comparable to or better than supervised transfer learning with fully labelled auxiliary data.
2
Related Work
Most existing transfer learning methods for object recognition are supervised (i.e. they require labelled auxiliary data), and make the assumption that the auxiliary classes must be related to each target object class. In [9], hierarchical category structure is needed for selecting auxiliary classes. Similarly, Stark et al. [6] transfer the variance of object shape from closely related object classes using a generative model inspired by Constellation model [16]. Alternatively, in [4, 5] a model-free approach is taken for which common attribute information has to be specified manually. In comparison, our KAT model does not assume any relationship between auxiliary object classes and a target class, and the knowledge learned and transferred by our method can be extracted from auxiliary object classes unrelated (i.e. not similar) to the target class. There have been a number of attempts to transfer general knowledge about object visual appearance using unrelated object categories. Fei-Fei et al. [7] formulated an one-shot Bayesian learning framework based on Constellation models. Models are learned from labelled auxiliary categories and the model parameters are averaged and used in the form of a prior probability density function in the parameter space for learning a target object class. As pointed out by Bart and Ullman [3], since only one prior distribution is used, this approach can be biased by the dominant or common auxiliary categories, and the method is unable to detect and prevent this from happening. The feature adaptation method proposed by Bart and Ullman [3] overcomes this problem by learning a
530
W.-S. Zheng, S. Gong, and T. Xiang
set of discriminative features from each labelled auxiliary category and selecting the features extracted from target category samples towards those discriminative auxiliary features. The adaptation is meaningful even if only one of 100 auxiliary categories is related to the target class. Similar to [3], our method can avoid the problem of bias towards dominant auxiliary categories. The difference is that we do not need to assume the availability of any related auxiliary categories since the knowledge transferred by our method is concerned with how objects in general can be distinguished from background. Moreover, no labelling is required for auxiliary data using our model. The most closely related work is the self-taught learning (STL) method proposed by Raina et al. [17] which also does not require labelling of auxiliary data, even though another unsupervised transfer learning method was also proposed recently [18]. The method in [18] is transductive and requires testing data to be involved in the training stage which may be considered less desirable. As compared to STL, our KAT model is superior in three ways. (1) The knowledge extracted by STL is a set of feature bases that best describe how objects look alike. In contrast our KAT model extracts the general knowledge about how objects are distinguishable from cluttered background. It is thus more suitable for the one-class object recognition task. This is supported by our experimental results shown in Section 5. (2) Most importantly, the feature bases learned by STL from auxiliary data are trusted blindly and used for target category sample representation without discrimination. In contrast, our kernel adaptation method provides a principled way of quantifying and selecting the most useful auxiliary information relevant to different target categories for discrimination. As a result, our KAT model is more robust against negative knowledge transfer. (3) The STL model is computationally much more expensive than our KAT model, especially on high-dimensional features, as l1 -norm based optimisation is required. Our kernel adaptation transfer (KAT) is a multiple kernel learning (MKL) method. KAT differs from the non-transfer MKL [19, 20] in that it is specially designed with two proposed constraints for unsupervised cross-category transfer learning to fuse a target kernel with a number of auxiliary kernels built based on lots of auxiliary data that may be irrelevant to target category. Though several work on kernel function learning has been exploited before for transfer learning [10, 11, 13, 21], KAT is formulated for unsupervised cross-category transfer learning, whilst [10, 11, 13, 21] assume that the target class data and auxiliary data are from different domains of the same category (e.g. news video footage from different countries) and they can neither be applied directly to nor easily extended for our cross-category unsupervised transfer learning task.
3
Kernel Adaptation for Knowledge Transfer
Let us first describe in this section our general kernel adaptation transfer learning framework before formulating a specific model for an object recognition probt lem in the next section. Given a training dataset {xti , yit }N i=1 where yi ∈ {1, −1} t t is the label of xi and we call yi = −1 the negative class (e.g. background) and
Unsupervised Selective Transfer Learning for Object Recognition
531
yit = 1 the target class (e.g. object). We wish to transfer knowledge extracted from an unlabelled (and possibly also irrelevant) auxiliary dataset {xaj } to the target class data, where {xaj } do not contain any samples of the target class. We consider that the auxiliary knowledge is represented by a set of auxiliary mappings fs , s = 1, · · · , m which are learned from auxiliary data such that fs (xti ), s = 1, · · · , m are the candidate transferrable knowledge for each training target data xti . The form of the auxiliary mapping fs (xti ) depends on the specific transfer learning problem, for example a feature vector. Given any auxiliary mappings extracted from unlabelled auxiliary data, our objective is to quantify and select the transferrable ones and combine them with the data {xti , yit }N i=1 in order to get better recognition performance. More specifically, the problem becomes learning a mapping g by combining fs (xti ) and xti in order to generate a new d-dimensional vector as follows: t d g : (xti , {fs (xti )}m s=1 ) −→ zi ∈ .
(1)
In this paper, we learn such a g via kernel methods. It is motivated because (a) rather than being explicitly defined, by formulating a kernel framework, g can be implicitly induced by a Mercer kernel function, which we call the transfer kernel function in this paper; (b) by optimising the transfer kernel, we obtain a principled way for quantifying and selecting transferrable knowledge, which is critical for minimising negative knowledge transfer. We now show why and how a unsupervised transfer learning problem can be formulated as a multi-kernel adaptation problem. From the statistical point of view, by computing a Mercer kernel matrix K = (κ(xti , xtj ))ij , each entry κ(xti , xtj ) can be considered to be proportional to the underlying pairwise distribution of any two data points xti , xtj in the training dataset, that is, p(xti , xtj ) = C · κ(xti , xtj ),
(2)
where C is some distribution normalizer. Assuming for generating p(xti , xtj ) there is a latent function variable f with probability density distribution p(f ), the distribution p(xti , xtj ) can then be expressed as: p(xti , xtj ) =
f
= f
p(xti , xtj , f ) =
f
p(xti , xtj |f )p(f )
p˜f (f (xti ), f (xtj ))p(f ) ≈
fs ,s=0,··· ,m
p˜fs (fs (xti ), fs (xtj ))P (fs ),
(3)
where we define p˜f (f (xti ), f (xtj )) = p(xti , xtj |f ) being a conditional density function with function f imposed on data, f0 is the identity function (i.e. f0 (x) = x) and fs , s = 1, · · · , m are the auxiliary mappings. The discrete approximation in Eqn. (3) is based on the assumption that the knowledge (fs (xti ))) extracted from target and auxiliary data can be used to infer the underlying density function p(xti , xtj ). Note that since f0 is the identity function, we allow p˜f0 (xti , xtj ) to be different from p(xti , xtj ). This is because p˜f0 (xti , xtj ) is an approximation of p(xti , xtj ), which is the unknown intrinsic density function of target object class.
532
W.-S. Zheng, S. Gong, and T. Xiang
The above model provides a way to estimate the intrinsic density function by transferring auxiliary knowledge for target class in the form of multi-kernel adaptation. Specifically, we let p˜fs (fs (xti ), fs (xtj )) = Cs · κs (fs (xti ), fs (xtj )), s = 0, · · · , m
(4)
for some distribution normalizer Cs with respect to kernel κs . Let bs = C −1 · Cs · P (fs ), s = 0, · · · , m, then we can replace p(xti , xtj ) in Eqn. (3) using Eqn. (2) and p˜fs (fs (xti ), fs (xtj )) using Eqn. (4). Eqn. (3) can thus be rewritten as: m κ(Ati , Atj ) = b0 · κ0 (xti , xtj ) + bs · κs (fs (xti ), fs (xtj )), (5) s=1
where we denote κ(Ati , Atj ) = κ(xti , xtj ), Ati = xti , {fs (xti )}m s=1 , κ0 is the kernel function on data xti itself, and κs is constructed using the transferrable knowledge fs (xti ) extracted from the auxiliary data. We call κ the kernel transfer function. Learning optimal non-negative weights {bi }m s=0 from data is equivalent to learning an optimal transfer kernel matrix K = (κ(Ati , Atj ))i,j via the combination of kernel matrices Ks = (κs (fs (xti ), fs (xtj )))i,j , s = 0, · · · , m. By exploiting the close relationship between kernel product and the probability of pairwise data, we learn an optimal transfer kernel matrix such that the entry value of intra-class pairs in the kernel matrix is as large as possible while those of the other entries are as small as possible. Let b = (b0 , b1 , · · · , bm )T , we thus have: b = arg max b
K(:)T K+ opt (:)
T K(:)T K− opt (:) + α · b b
,
(6)
s.t. bs ≥ 0, s = 0, · · · , m, α > 0, where K(:) represents a vectorised kernel matrix and K(:) = Ψb, Ψ = [K0 (:), − K1 (:), · · · , Km (:)] , and K+ opt and Kopt are defined as follows: 1 yit = yjt , 0 yit = yjt , + − Kopt (i, j) = Kopt (i, j) = (7) t t 0 yi = yj , 1 yit = yjt . Note that the regularisation term α · bT b is necessary in order to avoid learning a trivial b that one entry of b is always 1 and the others are zero. In this work, α is set to 0.01. Rather than directly learning b based on the above multiple kernel learning criterion, we further introduce two additional constraints to address an imbalanced kernel fusion problem as follows. Different from most non-transfer MKL work [19, 20], our transfer learning is to combine a single target kernel (κ0 ) and a large amount of auxiliary kernels (κs , s ≥ 1). Hence ineffective/harmful auxiliary kernels could have large effect on this kind of fusion, which would result in negative transfer. In order to balance the effect of the auxiliary kernels on the combined kernel, we introduce two constraints are described below: Hypothesis Constraint. It constrains the order between the weight of target kernel and the weights of auxiliary kernels as follows: b0 ≥ bs , s ≥ 1,
(8)
Unsupervised Selective Transfer Learning for Object Recognition
533
which enforces that more weight is given to the kernel function built directly from the target class than any other auxiliary kernels. This is intuitive as one wants to trust more on the limited target class data than any auxiliary mapping when auxiliary data are unlabeled. Partial Sparsity Constraint. As a second constraint, which is the most important, we impose the following partial l1 -norm based sparsity penalty on b on auxiliary data for minimization: bT 10 , 10 = [0, 1, 1, · · · , 1]T .
(9)
Note that this sparsity penalty is partial because it does not apply to the target class kernel. The objective of this partial sparsity constraint is to allow only a small portion of the relevant auxiliary kernels to be combined with the target class kernel. In our experiments, we demonstrate that these two constraints play a critical role in minimising negative knowledge transfer. With those two constraints, our proposed kernel adaptation transfer (KAT) model is formulated as: b = arg max b
K(:)T K+ opt (:)
T T K(:)T K− opt (:) + α · b b + λ · b 10
,
(10)
s.t. b0 ≥ bs ≥ 0, s = 1, · · · , m, α > 0, λ ≥ 0. Solving the above optimisation problem is nontrivial. However, by reformulating Criterion (10) alternatively as follows, a quadratic programming solver [22] can be exploited to find the solution: T T b = arg min bT ΨT K− opt (:) + α · b b + λ · b 10 , b
s.t. bT ΨT K+ opt (:) = 1; b0 ≥ bs ≥ 0, s = 1, · · · , m, α > 0, λ ≥ 0.
(11)
where λ is a free parameter that controls the strength of the partial sparsity penalty (thus how much auxiliary knowledge can be transferred). Specifically the smaller λ is, the more auxiliary knowledge can be transferred to target data, and no auxiliary knowledge can be transferred if λ = +∞, which would set all weights bs , s = 1, · · · , m on auxiliary kernels κs to zero. Finally, we note that except the two proposed constraints for transfer learning, KAT is closely related to the widely adopted kernel alignment method [23], which assumes the optimal entry values of kernel matrix K should be 1 for intraclass pairs and 0 otherwise. In comparison, Eqn (10) relaxes this assumption and maximises the ratio between those two types of values. Due to the nature of data distribution (e.g. Caltech 256), the strict assumption made by kernel alignment would not be held and could result in learning a combined kernel being dominated by the large amount of ineffective auxiliary kernels.
534
4
W.-S. Zheng, S. Gong, and T. Xiang
Knowledge Transfer for Object Recognition
We now re-formulate the unsupervised transfer learning framework described in Section 3 into a practical model for the one-class object recognition problem. For such a problem, it is assumed that an image contains either one object from a target class or only background clutter [7, 9, 24]. By applying our transfer learning framework we assume that for each target object category of interest, there are only very limited training samples (from only one sample to no more than a dozen at best) and an auxiliary dataset containing a large number of different object categories, where the auxiliary images are without any class labels and could be irrelevant to the target class. In addition we have a random set of cluttered background images which can be obtained by simply searching “Things” on Google Image [15]. For object representation, we first detect salient feature points using the Kadir and Brady detector [25]. A fixed number of feature points are then retained by thresholding the saliency value. The spatial distribution of those points is represented by a polar geometry structure with o orientational bins and r radial bins as illustrated in Figure 1. This polar structure is automatically expanded i −1 B · i=1 Zyi ), from the center of the detected feature points, i.e. (B −1 · B i=1 Zx , B i i where B is the number of feature points and (Zx ,Zy ) is the coordinate of the ith feature point. The maximum value of the radius is set to dmax = maxi ||Zxi − i i −1 B B −1 · B · i=1 Zyi ||2 so that all feature points are covered in i=1 Zx ||2 + ||Zy − B the polar structure. For representing the appearance of the feature points, SIFT descriptors [26] are computed at each feature point, and for each bin in the polar structure a normalized histogram vector is constructed using a codebook with 200 codewords obtained using k-means. This gives us a 3400 dimensional feature vector (xti in Eqn. (5)). Note that this polar object representation scheme is not only much simpler and easier to compute than the Constellation model based representation [16, 24, 27], but more importantly, it is more suitable for our discriminative based kernel adaptation transfer model. Given the training target and background data {xti , yit }N i=1 , we now describe how the auxiliary mappings fs (xti ) (see Eqn. (1)) are obtained. We formulate the auxiliary mappings that capture knowledge about how objects look different and distinguishable from any background. We thus formulate the mappings as decision boundaries between groups of objects and cluttered background learned using a Support Vector Machine (SVM). More specifically, the unlabelled auxiliary data is clustered using an ensemble of k-means, resulting in m clusters in total. That is, multiple k-means with different k values. In this way, the same auxiliary image can belong to different clusters simultaneously. This is intended to reflect the fact that an object can be distinguished from background in many different ways (e.g. colour, texture, shape). For each cluster, denoted as Gs , a decision function hs (x) is learned by linear SVM. These local (cluster specific) decision functions are then used as the mapping functions, i.e. fs (xti ) = hs (xti ). Note that outputs of SVM classifier as features are also adopted in [9, 20, 28], but the objectives of using these features are different here.
Unsupervised Selective Transfer Learning for Object Recognition
535
Fig. 1. The polar structure used for object representation (o = 8, r = 3 and 17 bins)
Now with the m auxiliary mappings fs (xti ), we can quantify them and compute the kernel function κ(Ati , Atj ) in Eqn. (5) for the target category, which is a combination of multiple kernels, where we use RBF as the basis kernels. Finally, this kernel function is used to train a SVM for classifying the target object category against cluttered background.
5
Experiments
Experiments were conducted on Caltech256 dataset [15]. It contains a broad ranges of objects from 256 object categories and a cluttered background set. In each experiment, we randomly selected 30 categories as target classes, and selected 10 images from each of the remaining 226 categories to form the auxiliary dataset and their labels were ignored in all the unsupervised transfer learning part of the experiments. This was repeated for 10 times with a different 30 target classes randomly drawn each time, giving 300 one-class object recognition tasks in total. For each target class, unless otherwise stated (as in Figure 2), 5 target images were randomly selected for training and the rest (always more than 80 images) were used for testing. The background set was randomly divided into 3 subsets: 20% of the total number of background images were used for learning the auxiliary kernels, 30% were used for learning the transfer kernel, and the rest 50% were for testing. For different target classes, the same training and testing background image sets were used. The recognition performance was measured using both equal error rate (EER) and receiver-operating characteristic (ROC) curve [16]. For object representation, we selected 300 feature points in each image according to their entropy saliency values [25]. In KAT, for clustering the auxiliary data using an ensemble of k-means, the value of k was set to {1, 5, 11, 23, 28, 45, 75} giving a total of 188 clusters. For λ in Eqn. (11), cross validation (CV) was performed when the number of training samples for target object (p) is larger than 1. Specifically, two-fold CV was used for p = 2 and three-fold CV was used for p = 5, 10, 15. Note that the training background images were also used for cross-validation. Then, when the target class training set contains a single sample (i.e. one shot learning), λ was set to 0 (i.e. all auxiliary knowledge is transferred) as cross validation becomes impossible on target object data. Kernel adaptation transfer vs. without transfer. We first compared the performance of our KAT model, learned using both target samples (up to 15) and the unlabelled auxiliary data, against a non-transfer model, namely a SVM
W.-S. Zheng, S. Gong, and T. Xiang
0.4 0.35
0.38 0.36 0.34
70
0.3 1 2 5 10 15 #Target Training Samples
0.18
60 50 40 30 20
1 2 5 10 15 #Target Training Samples
(c) #Pos Max Reduced EER
EER
0.4
(b) Pos Avg EER
KAT Non−transfer EER
(a) Avg EER 0.42
Postive Transfer (%)
536
1 2 5 10 15 #Target Training Samples
(d) Max Reduced EER KAT
0.16 0.14 0.12 0.1 0.08
1 2 5 10 15 #Target Training Samples
Fig. 2. KAT vs Non-transfer model. From left to right, the Avg EER, Pos Avg EER, #Pos (see text for explanation), and Maximum Reduced EER are shown.
60 40 20 0
0
50 False accept rate (%)
100
80 60 40 20 0
0
50 False accept rate (%)
superman
100
Verification rate (%)
Verification rate (%)
Verification rate (%)
80
sneaker
100
100
80 60 40 20 0
0
50 False accept rate (%)
chess−board
100
Verification rate (%)
porcupine
100
100
80 60 40
KAT Self−taught
20 0
Non−transfer 0
50 False accept rate (%)
100
Fig. 3. KAT model ROC performance curves from four example target classes
classifier trained using only target samples. In this particular comparison, we varied the number of training target category samples from 1 to 15 in order to understand the effect of transfer learning when increasing number of target samples becomes available. The same image descriptor was used for both methods so the performance difference was solely due to using transferred knowledge. The results are shown in Figure 2, where “Avg EER” is the average EER over 300 trials, “Pos Avg EER” is the average EER over those trials when EER was reduced by positive transfer, “#Pos” is the number of trials with positive transfer, and “#Neg” is the number of trials with negative transfer. Figure 2(a) shows that on average the EER was reduced from 0.428 to 0.392 when only one target sample was used from each target class, representing a 8.4% improvement in recognition performance. However, when the number of available target samples increases, the contribution of transfer learning diminishes, shown by the narrowed EER gap between the KAT and the non-transfer model. This is expected as transferring auxiliary kernels has its greatest benefit when the available target training sample is minimal, e.g. with only 1 target sample. Figure 2(c) shows that positive transfer was achieved for 67.7% of the 300 trials for one-shot learning, and reduced to 37.3% when the target training samples was increased to 15 for each category. Interestingly although both the maximum EER reduction per class (Figure 2(d)) and the average EER decrease as the number of data samples increases, the average reduction for the classes with positive transfer was about the same. This suggests that the decrease in the number of positive transfer classes (Figure 2(c)) was the reason for the narrowed EER gap in Figure 2(a). Figure 3 shows the performance of our KAT on 4 target categories using only 5 target samples for training each. The randomly selected training samples can be seen in the left column of Figure 4 which clearly show that there are huge
Unsupervised Selective Transfer Learning for Object Recognition
537
variations in appearance and view angle for each class. Despite these variations, Figure 3 shows that very good performance was obtained with significant improvement over the non-transfer model. Figure 4 gives some insight into what auxiliary categories have been extracted and transferred to the target classes by the KAT model. In particular, it can be seen that KAT can automatically select object images from related classes if they are present in the auxiliary data (e.g. a butterfly for superman that flies), or from objects that share some similarity in shape or appearance (e.g. bonsai for porcupine, boat for shoe, pool table for chess board) for extracting transferrable knowledge. Kernel adaptation transfer vs. self-taught learning. We compared our KAT model with the most related unsupervised transfer learning method we are aware of, the Self-Taught Learning (STL) model [17]. For STL, we followed the PCA dimension reduction procedure in [17] on the high-dimensional descriptors and exactly the same procedure in [17] to train the self-taught learner. The number of sparse feature bases was set to be the same as the reduced dimensionality and the sparse weights were tuned in {0.005, 0.05, 0.1, 0.5} [17] due to the computational cost of the STL model, as discussed later. As STL is computationally expensive, it is not possible to select the parameter using cross-validation in practice. Instead, we tuned the parameter of STL using the test data to illustrate the best performance STL can possibly achieve. In other words, lower performance is expected for STL if cross validation is used. The performances of the two unsupervised transfer learning methods are compared in Table 1. It shows that a majority (196) of the 300 object recognition tasks result in negative transfer using STL. As a result, the overall performance of STL was actually worse than the baseline non-transfer model (0.409 vs. 0.373). This is because STL cannot control how much auxiliary knowledge should be transferred for different target categories. In comparison, because of measuring the usefulness of and automatically selecting transferrable knowledge for different target classes, our KAT achieved far superior results with only 84 out of 300 tasks lead to negative transfer. Table 1 also shows that even for those positive transfer tasks, KAT yields higher improvement over STL (0.043 vs. 0.038 in EER reduction). This suggests that the knowledge transferred by our method, which aims to discriminate objects from background, is more suitable for the recognition task than that of the STL method, which is generative in nature and aims to capture how objects look in general. More detailed comparison on specific object recognition tasks is presented in Figure 3. Table 1. Comparing KAT with STL [17] and a generic non-background classifier (Aux) given 5 samples per class. The numbers in brackets are the performance of non-transfer model for the target classes for which positive transfer was achieved. Method KAT STL Aux Non-Transfer
Avg EER Pos Avg EER #Pos #Neg 0.357 0.409 0.388 0.373
0.332 (0.375) 187 84 0.388 (0.426) 87 196 0.369 (0.419) 123 161 N/A N/A N/A
538
W.-S. Zheng, S. Gong, and T. Xiang 7UDLQLQJ ,PDJHV RI 7DUJHW &DWHJRU\
6RPH RI WKH 6HOHFWHG $X[LOLDU\ ,PDJHV
Fig. 4. Examples of target categories (left) and the auxiliary data that provide the most transferrable knowledge (right) by KAT. The four target categories from top to bottom are porcupine, sneaker, superman, and chess-board respectively. 7DUJHW 0DQXDOO\ 6HOHFWHG $X[LOLDU\ 1RQ &DWHJUR\ &DWHJRULHV WUDQVIHU
.$7
6XSHUYLVHG .$7
7DUJHW &DWHJUR\
0DQXDOO\ 6HOHFWHG $X[LOLDU\ &DWHJRULHV
1RQ WUDQVIHU
.$7
6XSHUYLVHG .$7
Fig. 5. Supervised KAT vs. KAT
On computational cost, STL is very expensive compared to KAT for highdimensional data. This is mainly due to the costly l1 -norm sparsity optimization in STL. On a computer server with an Intel dual-core 2.93GHz CPU and 24GB RAM, it took between 4 to 30 hours to learn STL sparse bases depending on the sparsity weight and about 5-20 minutes per category for computing the coefficients of all training and testing data. In contrast, our KAT took on average 20 minutes to learn each category so is at least one magnitude faster. Recognition using only auxiliary knowledge. To evaluate the significance of selecting unlabelled auxiliary knowledge in KAT based on the limited target class samples, we compare KAT to a generic non-background classifier termed ‘Aux’ using auxiliary data only. Table 1 shows that based on the general knowledge extracted from all objects in the auxiliary dataset alone, the performance of Aux is much better than random guess (0.5), but worse than the non-transfer model and much worse than our KAT. This result confirms the necessity and usefulness of performing transfer learning provided that the extracted auxiliary knowledge is indeed useful, and is quantified and selected. Supervised KAT vs. KAT. To compare how unsupervised transfer learning fares against supervised transfer learning given a fully labelled auxiliary dataset of only related object categories to a target class, we selected related auxiliary categories for 6 target categories (car tire, giraffe, lathe, cormorant, gorilla, and fire truck) as shown in Figure 5. With these labelled and related auxiliary data, we implemented a supervised version of our KAT, that is, auxiliary mappings were constructed from each auxiliary class, instead of relying on clustering. As
Unsupervised Selective Transfer Learning for Object Recognition
539
Table 2. Further investigation on transferrable knowledge selection #Target Samples
1 2 5 10 15
KAT 203 / 84 208 / 65 187 / 84 66 / 25 112 / 28
#Pos / #Neg KAT KAT(λ = 0) Naive 203 / 84 196 / 92 203 / 78 166 / 114 134 / 138 108 / 184 72 / 80 22 / 138 76 / 151 44 / 196
KAT (full sparsity)
196 / 92 151 / 128 84 / 205 25 / 137 36 / 204
KAT KAT KAT(λ = 0) Naive 0.392 0.392 0.401 0.370 0.371 0.399 0.357 0.387 0.424 0.343 0.354 0.413 0.325 0.363 0.428
Avg EER KAT (full sparsity)
0.401 0.416 0.446 0.415 0.450
Kernel NonAlignment Transfer 0.398 0.428 0.414 0.404 0.442 0.373 0.382 0.349 0.403 0.336
Fig. 6. Examples of object categories that no improvement was obtained using KAT. From left to right: background, galaxy, and waterfall.
shown in Figure 5, with those labelled data from the related auxiliary categories, the supervised KAT does not always achieve better results compared to (unsupervised) KAT. This is because our KAT can select automatically those related samples for building auxiliary kernels when they are present, but more important also utilises any relevant and available information about how objects are distinguishable from background from all the irrelevant auxiliary data. Further investigation on transferrable knowledge selection. 1) With vs. without constraints. We validate the usefulness of the two constraints introduced in KAT: (1) a hypothesis constraint (Eqn. (8)) and (2) a partial sparsity penalty controlled by λ (Eqn. (9)), for alleviating negative transfer. Table 2 shows the performance of KAT without the second constraint (termed KAT(λ = 0)) and without both (termed KAT-Naive). Note that as aforementioned for one target sample, the λ in KAT as well as its variants was set to 0. It is evident that the usefulness of the extracted auxiliary knowledge is significantly weakened (more than 33% higher in average EER compared to KAT in some case) without these constraints and much less positive transfer is obtained. 2) KAT vs. Kernel Alignment. We compare KAT with the widely adopted kernel alignment method for kernel fusion in Table 2 [23]. As shown KAT performs much better, because KAT introduces our two proposed constraints to balance the effects of target and auxiliary kernels, but kernel alignment does not explicitly consider the important difference between those two effects. 3) KAT vs. KAT(full sparsity). We show that the most popular way for kernel selection using a full l1 norm sparsity on all kernel weights bi , i = 0, · · · , m cannot work well for our unsupervised cross-category transfer learning. This is demonstrated by the KAT(full sparsity) that uses the full sparsity as in previous MKL work [29] on all kernel weights for kernel fusion in Table 2. The results show that without differentiating target kernel from auxiliary kernels using the proposed constraints, the full sparsity penalty leads to much worse performance. The failure mode. As shown in table 1, some object categories cannot benefit from KAT. Figure 6 shows examples of three such categories. For these categories,
540
W.-S. Zheng, S. Gong, and T. Xiang
the sample images either contain large portion of background (e.g. waterfall) or contain objects that do not have clear contour but have similar textures as cluttered background (e.g. galaxy). Since the similarity to background is greater than that to other object classes in auxiliary data (see Figure 4), the extracted transferrable knowledge thus would not help for these object categories.
6
Conclusion
We introduced a novel unsupervised selective transfer learning method using Kernel Adaptation Transfer (KAT) which utilises unlabelled auxiliary data of largely irrelevant object classes to any target object category. The model quantifies and selects the most relevant transferrable knowledge for recognising any given target object class with very limited samples. For one-class recognition, KAT selects the most useful general knowledge about how objects are visually distinguishable from cluttered background. Our experiments demonstrate clearly that due to its transferrable knowledge selection capability, the proposed unsupervised KAT model significantly outperforms the Self-Taught Learning (STL) method. Despite only implemented for one-class recognition tasks in this work, the proposed KAT framework is a general transfer learning method that can be readily formulated for other pattern recognition problems with large intra-class variation and sparse training data conditioned that a set of auxiliary mappings fs can be constructed. Our current work includes extending this model to multiclass object detection and one-to-many object verification. Acknowledgement. This research was partially funded by the EU FP7 project SAMURAI with grant no. 217899.
References 1. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001) 2. Pinker, S.: How the minds works. W. W. Norton, New York (1999) 3. Bart, E., Ullman, S.: Cross-generalization: Learning novel classes from a single example by feature replacement. In: CVPR (2005) 4. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: CVPR (2009) 5. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: CVPR (2009) 6. Stark, M., Goesele, M., Schiele, B.: A shape-based object class model for knowledge transfer. In: ICCV (2009) 7. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. PAMI 28, 594–611 (2006) 8. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 1345–1359 (2010) 9. Zweig, A., Weinshall, D.: Exploiting object hierarchy: Combining models from different category levels. In: ICCV (2007)
Unsupervised Selective Transfer Learning for Object Recognition
541
10. Pan, S.J., Kwok, J.T., Yang, Q.: Transfer learning via dimensionality reduction. In: AAAI, pp. 677–682 (2008) 11. Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. In: IJCAI (2009) 12. Yang, J., Yan, R., Hauptmann, A.G.: Cross-domain video concept detection using adaptive svms. In: ACM Multimedia, pp. 188–197 (2008) 13. Duan, L., Tsang, I.W., Xu, D., Maybank, S.J.: Domain transfer svm for video concept detection. In: CVPR (2009) 14. Rosenstein, M.T., Marx, Z., Kaelbling, L.P., Dietterich, T.G.: To transfer or not to transfer. In: NIPS 2005 Workshop on Transfer Learning (2005) 15. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset, Technical Report UCB/CSD-04-1366, California Institute of Technology (2007) 16. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learnings. In: CVPR (2003) 17. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: transfer learning from unlabeled data. In: ICML, pp. 759–766 (2007) 18. Dai, W., Jin, O., Xue, G.R., Yang, Q., Yu, Y.: Eigentransfer: a unified framework for transfer learning. In: ICML, p. 25 (2009) 19. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., Jordan, M.: Learning the kernel matrix with semidefinite programming. JMLR 5, 27–72 (2004) 20. Gehler, P., Nowozin, S.: On feature combination for multiclass object classfication. In: ICCV (2009) 21. Daum´e, H.: Frustratingly easy domain adaptation. In: ACL (2007) 22. Nocedal, J., Wright, S.: Numerical optimization, 2nd edn. Springer, Heidelberg (2006) 23. Cristianini, N., Kandola, J., Elisseeff, A., Shawe-Taylor, J.: On kernel-target alignment. In: NIPS (2002) 24. Hillel, A.B., Hertz, T., Weinshall, D.: Efficient learning of relational object class models. In: ICCV (2005) 25. Kadir, T., Brady, M.: Saliency and image description. IJCV 45, 83–105 (2001) 26. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 2, 91–110 (2004) 27. Crandall, D.J., Huttenlocher, D.P.: Weakly supervised learning of part-based spatial models for visual object recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 16–29. Springer, Heidelberg (2006) 28. Schnitzspan, P., Fritz, M., Schiele, B.: Hierarchical support vector random fields: Joint training to combine local and global features. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 527–540. Springer, Heidelberg (2008) 29. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: More efficiency in multiple kernel learning. In: ICML (2007)
A Heuristic Deformable Pedestrian Detection Method Yongzhen Huang, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences, Beijing, China {yzhuang,kqhuang,tnt}@nlpr.ia.ac.cn
Abstract. Pedestrian detection is an important application in computer vision. Currently, most pedestrian detection methods focus on learning one or multiple fixed models. These algorithms rely heavily on training data and do not perform well in handling various pedestrian deformations. To address this problem, we analyze the cause of pedestrian deformation and propose a method to adaptively describe the state of pedestrians’ parts. This is valuable to resolve the pedestrian deformation problem. Experimental results on the INRIA human dataset and our pedestrian pose database demonstrate the effectiveness of our method.
1
Introduction
Pedestrian detection plays an important role in computer vision applications, e.g., surveillance, automatic navigation and human computer interactions. Many pedestrian detection algorithms have been developed. Haar wavelet-based cascade approaches [1] constructs a degenerate decision tree of progressively complex detector layers composed of a set of Haar wavelet features. Part based methods [2] utilize the orientation and position histograms to built parts detectors, and then combine them to a whole description of pedestrians. Edge based methods [3] design a set of enhanced edge features to capture the edge structure of pedestrians. The work of histogram of oriented gradient (HOG) [4] is a milestone in pedestrian detection. It is the first work that achieves impressive performance on pedestrian detection. The HOG algorithm uses the SIFT-like descriptor in each of dense overlapping grids with a local normalization strategy to generate feature maps. Then, the sliding window scheme is adopted, where the trained model is matched with windows at all positions and scales of an image. The HOG feature is good at describing the spatial distribution of pedestrians’ edge. In addition, it is insensitive to illumination change because of the local normalization and resistant to position and scale variation due to the sliding window strategy. The success of HOG spurs more researches on pedestrian detection. Many approaches have been developed, focusing on feature selection and combination [5], [6], [7], [8], [9], speed enhancement [10], [11], components learning and spatial relationship description [12], [13], combination with poses estimation [14] and machine learning applications, e.g., kernel tricks [15] and R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 542–553, 2011. Springer-Verlag Berlin Heidelberg 2011
A Heuristic Deformable Pedestrian Detection Method
543
Fig. 1. A demonstration showing the limitation of HOG in handling pedestrian deformation. (a) Six pedestrians with different poses and viewpoints. (b) The average gradient image of pedestrians. (c) The HOG feature map. (d) The positive model trained by the HOG algorithm. Obviously, one model cannot describe all poses of pedestrians.
different Boosting algorithms [16]. Comprehensive studies can be found in two recent surveys [17], [18]. However, these algorithms share all or some of the following problems: – Rely heavily on training data. It takes much time and resource to collect and label pedestrian images. Moreover, the trained model usually cannot be directly used for other scenes, e.g., the model trained by outdoor datasets may perform bad in indoor scenes. – Use fixed models to describe deformable pedestrians. For example, the HOG algorithm generates one template that best differentiates positive and negative samples in the training set. We can consider that it describes the average feature map of pedestrians. This strategy is limited to handle pedestrian deformation. Fig. 1 shows an example. Recently, some researchers combined pedestrian’s part models and their spatial relationship [13]. In this framework, each part is still described by a fix model. It ignores the deformation of parts themselves, and thus cannot describe all shapes of parts. For example, a standing person’s leg is different from a running person’s leg in shape. To address the above problems, we propose a heuristic deformable pedestrian detection approach. It doesn’t need any training samples and can adaptively describe shapes of pedestrians’ parts. This is because we use the prior knowledge on pedestrian deformation, including: 1) most pedestrian deformations are caused by variation of body parts, e.g., arms waving; 2) the components of a part, e.g., the upper and the lower arms, are rigid (limited by the rigidity of bones). Thus,
544
Y. Huang, K. Huang, and T. Tan
Fig. 2. An example showing person deformation and the “two lines” model
they can only rotate around joint points. 3) the joint points can only move in a reasonable range. Therefore, each key part can be described by the combination of two lines which can rotate around joint points. We call it as the “two lines” model. Fig. 2 illustrates an example. In the next section, we will elaborate how to adaptively estimate the location of joints points and the angle of two lines.
2
Heuristic Deformable Pedestrian Detection
In this section, we first briefly introduce our algorithm, and then elaborate each stage. The framework is shown in Fig. 3. It consists of five stages: 1. An input image is convolved by eight Gabor filters with different orientations. The outputs are Gabor feature maps (termed as S1 units in this paper). 2. S1 units are convolved by a group of component filters and the outputs are the response map of components (S2 units). 3. Two S2 units are combined to describe a part. The choice of S2 units is adaptive according to the maximum response of component filters with respect to the orientation. The output is the response map of parts (S3 units). 4. The most possible location of a part is adaptively calculated by searching the max response of S3 units in a predefined range, c.f. spatial constraints in Fig. 3. These locations indicate key parts that cause pedestrian deformation. 5. Finally, seven maximums are combined in each sliding window. The output is the detection window and the location of joint points (red points). 2.1
S1 Units
The S1 layer produces shape information of pedestrians. An input image is convolved by a group of Gabor filters with different orientations: S1(θ, x, y) = I(θ, x, y) ∗ G(θ, x, y),
G(θ, x, y) = exp(−
2π x20 + γ 2 y02 ) × sin( x0 ), 2 2σ λ
(1)
(2)
A Heuristic Deformable Pedestrian Detection Method
545
Fig. 3. The framework of our system
x0 = x cos θ + y sin θ, y0 = −x sin θ + y cos θ,
(3)
where I is the input image and G indicates the Gabor filter. The ranges of x and y are the abscissa and ordinate associated with the scales of Gabor filters and θ controls the orientations. The Gabor filter is similar to the receptive field profiles in the mammalian cortical simple cells [19]. It has good orientation and frequency selectivity. From the viewpoint of computer vision, Gabor filters are more robust to noises compared with the gradient filters ([-1, 0, 1] and [-1, 0, 1]’) adopted in the HOG algorithm. This is because they contain scale information and reflect the area statistic properties similar to the edges of pedestrians’ parts. An example is shown in Fig. 4. 2.2
S2 Units
Most existing pedestrian detection algorithms train one or several fixed models. They are limited on describing pedestrian deformations. To solve this problem,
546
Y. Huang, K. Huang, and T. Tan
Fig. 4. Comparison between the Gabor filter and the gradient filter. (a) A test image. The left line is usually not from pedestrian. The right object is more similar to components of pedestrians which are homogeneous over a region. (b) The response of the Gabor filter. The Gabor filter produces larger response if the object’s scale is similar to the size of the Gabor filter. (c) The response of the gradient filter. The gradient filter does not contain scale information, thus all vertical edges produce the same response.
we use an adaptive deformable model. As we analyze in Fig. 2, the components of a part, e.g., the upper and the lower arms, are rigid, and they can only rotate around joint points. Thus, a part can be described by the combination of two lines. In our system, we define 16 components filters (illustrated in Fig. 3). A component filter is a rectangular template defined in a binary array. The size of the component filter is double the one of the Gabor filter. One component filter reflects the orientation information of a component in a part, and the combination of two component filters can describe a part. To generate S2 units, each Gabor feature map (S1 unit) is convolved by the corresponding1 component filter: S2(θ, x, y) = S1(θ, x, y) ∗ P (θ, x, y),
(4)
where P is the component filters, θ is the orientation of the filters and S1 is defined in Eq. (1). 2.3
S3 Units
To describe parts’ information (S3 units), we calculate the most possible orientation for each component by searching the maximum response with respect to the orientation, in the predefined range of parts: S3(k, x, y) = max{S2(θ1 , x, y)} + max {S2(θ2 , x, y)},
(5)
x, y ∈ R(k),
(6)
θ1
θ2 ,θ2 =θ1
where k is the index of parts, θ is the orientation of component filters, and R is the predefined range of parts which can be estimated by prior knowledge (c.f. the spatial constraints in Fig. 3). This maximum operation adaptively finds the orientation information of the components of a part, i.e., the angle of two components. 1
Gabor filters and component filters are with 8 (0 − 180o ) and 16 (0 − 360o ) orientations, respectively. Each Gabor filter corresponds to two component filters.
A Heuristic Deformable Pedestrian Detection Method
547
The final score of a searching window is the sum of maximum response of all S3 units with respect to the spatial constraints: score =
7 k=1
max (S3(k, x, y)),
x,y∈R(k)
(7)
The most possible location of a part is adaptively computed by searching the max response of S3 in R(k). To the best of our knowledge, the proposed system is the first attempt of an adaptive (no training data) pedestrian detection system that handles deformations induced by pose and viewpoint variations. The subsequent section empirically studies the system.
3
Empirical Studies
In this section, we first test our algorithm in the INRIA human dataset [20]. In this experiment, we study the influence of parameters to the performance and compare our algorithm with some popular ones. Second, considering that the INRIA human dataset contains limited pose variations, we built a new pedestrian database that includes almost all possible poses of a normal pedestrian. Details are given in Section 3.2. In this experiment, we focus on the invariance to poses variations. Finally, we analyze the speed of our algorithm. 3.1
INRIA Human Dataset
The INRIA human dataset contains 1805 64*128 images of pedestrians cropped from a varied set of personal photos. Most pedestrians are bystanders with a wide variety of backgrounds. Besides, it provides 454 person-free images for testing. As our algorithm doesn’t need training sample, we directly test it on the testing set. There are three parameters in our algorithm: the size of Gabor filters, the number of component filters and the predefined range of parts in Eq. (6). We empirically find that the second and third parameters are insensitive to performance in a reasonable range. In all of our experiments, we set the number of component filters to 16, and the ranges of parts are shown in Fig. 3. To save page length and to keep the paper concise, we only analyze the influence of the Gabor size. Fig. 5 shows the performance. Our algorithm performs well when the Gabor size is 11 and 15, which is smaller than width of pedestrians’ head or limbs. The Gabor filters with this size produce desirable response in pedestrians’ edge. In this case, they can better describe the edge information of pedestrians. In the later experiments, we fix the size of Gabor filters to 11. Afterwards, we compare our algorithm with HOG [4], FtrMine [21], VJ’s method [1] and LSVM [13]. All results except ours are reported in [17]. The
548
Y. Huang, K. Huang, and T. Tan
Fig. 5. Performance of our method with different Gabor sizes on INRIA human dataset
testing strategies include per-window and per-image [17]. We strictly obey the testing rule2 [20] adopted in the evaluation of HOG. Fig. 6 and Fig. 7 show the results. Our algorithm does not performs best but comparable to HOG and LSVM which are two state-of-the-art pedestrian detection algorithms. Considering that our algorithm does not need any training samples, we think it performs well. In the later part of this paper, we will analyze the failure cases and discuss the potential enhancement of our method. 3.2
Pedestrian Poses Database
The INRIA database mainly focuses on differentiating pedestrians from various backgrounds. Subsequently, we will analyze the invariability to pedestrian deformation induced by pose and viewpoint variations. We built a large pedestrian pose database3 . This database contains 1,440 images at three viewpoints (0o , 45o, 90o ) and six different scenes. The database captures 30 persons, each with 48 poses. Also, we record the video when they regularly walk and run. Therefore, our database contains most poses of pedestrians. 2
3
For readers’ convenience, we cite the rule here: the starting scale in scalespace pyramid is one and keep adding one more level in the pyramid till f loor(ImageW idth/Scale) > 64 and f loor(ImageHeight/Scale) > 128. Scale ratio between two consecutive levels in the pyramid is 1.2. Window stride (sampling distance between two consecutive windows) at any scale is 8 pixels. The address of the database is now being updated. Please contact the author if you are interested in the databse.
A Heuristic Deformable Pedestrian Detection Method
549
Fig. 6. Performance comparison on INRIA human dataset by the per-window testing rule
Fig. 7. Performance comparison on INRIA human dataset by the per-image testing rule
The evaluation criterion is the “per-image” rule used in [17]: input an image and output a bounding box and score for each detection window. A detected bounding box and a ground truth bounding box are matched if their overlapped areas exceed 50% of their sum: area(BBdt ∩ BBgt ) > 0.5, (8) area(BBdt ∪ BBgt ) where BBdt is the detected bounding box and BBgt is the ground truth bounding box.
550
Y. Huang, K. Huang, and T. Tan
Fig. 8. Performance comparison between our method and the HOG algorithm
Fig. 9. Confidence distribution. (a) The score of each sample by HOG. (b) The score of each sample by our method. (c) The distribution of score by HOG. (d) The distribution of score by our method.
Unmatched BBdt is counted as false positives and unmatched BBgt as false negatives. We plot the miss rate against false positives per-image curve by varying the threshold on detection confidence. In addition, we provide the performance by HOG. The curves are shown in Fig. 8.
A Heuristic Deformable Pedestrian Detection Method
551
Fig. 10. Missed positive samples by our method. These samples are all in bad contrast conditions.
Fig. 11. (best view in color) 1st row: missed positive samples by HOG and their scores. These samples are with large pose variation. 2nd row: detection results by our method. 3rd row: detection results on multi pedestrians images by HOG. 4th row: detection results on multi pedestrians images by our method.
552
Y. Huang, K. Huang, and T. Tan
When the miss rate is high, HOG performs slightly better than our method. When the miss rate is low, specifically lower than 0.25, our method outperforms HOG. In addition, we provide the confidence distribution of all the samples by our method and the HOG algorithm in Fig. 9. The confidence scores by HOG are more widely distributed, which demonstrates that HOG is more sensitive to pose and viewpoint variations. Fig. 10 analyzes some missed samples by our method. Our method is not sufficiently robust to large contrast variation. This is one of mains problems we would like to addressed in the future work. Finally, we show some detection results by HOG and our method in Fig. 11. 3.3
Efficiency Analysis
It takes about one second for our algorithm to handle an image with the resolution of 320×240 by Matlab on a personal computer with 2.4G CPU (Inter Core2) and 4G RAM. The main computation is the filtering operation in calculating S2 units and S3 units (more than 95% computation of the whole algorithm). According to our previous experience, if we use IPP filtering function [22] that is 20 time faster than the Matlab filtering function, the system can achieve real-time (less than 0.1 second per image).
4
Discussion
Our method can be viewed as a group of deformable models for parts, and pedestrian detection is transformed to searching local optimal description of parts. In fact, this strategy has been utilized by [23] which uses a series of positive examples to estimate appropriate weightings of features relative to one another and produce a score that is effective at estimating configurations of pedestrians. Earlier, it comes from the analysis of articulated objects [24]. The novelty of our work is that our method can adaptively describe the deformable parts. This is because we effectively use the prior knowledge on pedestrian deformation (see analysis in the introduction section).
5
Conclusion
In this paper, we have analyzed the problems of current pedestrian detection algorithms, and presented a heuristic deformable pedestrian detection method. It can handle pedestrian deformation induced by pose and viewpoint variations without training samples. Experiments on the INRIA human dataset and our pedestrian pose database have demonstrated that our method is comparable to some state-of-the-art algorithms but does not require training samples. In the future, we will focus on enhancing the accuracy of our algorithm. We consider that the accuracy may be enhanced by further exploring the form of deformable part filters, the discrimination of different key parts and their spatial relationship. In addition, we intend to integrate the proposed algorithm with appearance based methods to improve the performance under poor contrast conditions.
A Heuristic Deformable Pedestrian Detection Method
553
References 1. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: ICCV (2003) 2. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004) 3. Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. International Journal of Computer Vision 75, 247–266 (2007) 4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 5. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: CVPR (2007) 6. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on riemannian manifolds. In: CVPR (2007) 7. Chen, Y.T., Chen, C.S.: Fast human detection using a novel boosted cascading structure with meta stages. IEEE Trans. on Image Processing 17, 1452–1464 (2008) 8. Wu, B., Nevatia, R.: Optimizing discrimination-efficiency tradeoff in integrating heterogeneous local features for object detection. In: CVPR (2008) 9. Schwartz, W.R., Kembhavi, A., Harwood, D., Davis, L.S.: Human detection using partial least squares analysis. In: ICCV (2009) 10. Zhu, Q., Yeh, M.C., Cheng, K.T., Avidan, S.: Fast human detection using a cascade of histograms of oriented gradients. In: CVPR (2006) 11. Zhang, W., Zelinsky, G., Samaras, D.: Real-time accurate object detection using multiple resolutions. In: CVPR (2007) 12. Doll´ ar, P., Babenko, B., Belongie, S., Perona, P., Tu, Z.: Multiple component learning for object detection. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 211–224. Springer, Heidelberg (2008) 13. Felzenszwalb, P., Mcallester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008) 14. Andriluk, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: CVPR (2009) 15. Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support vector machines is efficient. In: CVPR (2008) 16. Begard, J., Allezard, N., Sayd, P.: Real-time human detection in urban scenes: Local descriptors and classifiers selection with adaboost-like algorithms. In: CVPR Workshops (2008) 17. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: CVPR (2009) 18. Enzweiler, M., Gavrila, D.M.: Monocular pedestrian detection: Survey and experiments. IEEE Trans. on Pattern Analysis and Machine Intelligence 31, 2179–2195 (2009) 19. Jones, J.P., Palmer, L.A.: An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology 58, 1233– 1258 (1987) 20. http://pascal.inrialpes.fr/data/human/ 21. Dollar, P., Tu, Z., Tao, H., Belongie, S.: Feature mining for image classification. In: CVPR (2007) 22. http://www.intel.com/cd/software/products/asmo-na/eng/302910.htm 23. Tran, D., Forsyth, D.: Configuration estimates improve pedestrian finding. In: NIPS (2008) 24. Ramanan, D.: Learning to parse images of articulated objects. In: NIPS (2006)
Gradual Sampling and Mutual Information Maximisation for Markerless Motion Capture Yifan Lu1 , Lei Wang1 , Richard Hartley1,3 , Hongdong Li1,3 , and Dan Xu2 1
2
School of Engineering, CECS, Australian National University Department of Computer Science and Engineering, SISE, Yunan University 3 Canberra Research Labs, National ICT Australia {Yifan.Lu,Lei.Wang,Richard.Hartley,Hongdong.Li}@anu.edu.au,
[email protected]
Abstract. The major issue in markerless motion capture is finding the global optimum from the multimodal setting where distinctive gestures may have similar likelihood values. Instead of only focusing on effective searching as many existing works, our approach resolves gesture ambiguity by designing a better-behaved observation likelihood. We extend Annealed Particle Filtering by a novel gradual sampling scheme that allows evaluations to concentrate on large mismatches of the tracking subject. Noticing the limitation of silhouettes in resolving gesture ambiguity, we incorporate appearance information in an illumination invariant way by maximising Mutual Information between an appearance model and the observation. This in turn strengthens the effectiveness of the better-behaved likelihood. Experiments on the benchmark datasets show that our tracking performance is comparable to or higher than the state-of-the-art studies, but with simpler setting and higher computational efficiency.
1
Introduction
Recent advances of markerless motion capture have produced a number of new methods and approaches. Balan et al. [1] extended a deformation scheme to recover the detailed human shape and pose from images. Corazza and Mundermann et al.’s work employs an articulated ICP based method to register body segments with a sequence of visual hulls. Meanwhile, high quality performance capture [2] presented by de Aguiar et al. proposes a new mesh-based framework which captures gestures of the subject and recovers small-scale shape details. The similar study [3] from Starck and Hilton adopts a graph-cut global optimisation for highly realistic surface capture. Most of the above works reside on the same underlying concept, shape-from-silhouette [4]. A bounding geometry of the original 3D shape, so called Visual Hull, is determined by intersecting generalised cones formed by back-projecting each multi-view silhouette with camera parameters. Moreover, these studies employ a similar approach that adjusts the human template model to explicitly or implicitly fit a visual hull in order to find the best posture. Unfortunately, when only a limited number of camera views R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 554–565, 2011. Springer-Verlag Berlin Heidelberg 2011
Gradual Sampling and Mutual Information Maximisation
555
and moderate quality data are available, approaches in this category often suffer from low tracking accuracy and the lack of robustness as pointed out in [5]. This is due to gesture ambiguities raised from two major facts: 1) With a limited number of camera views, the visual hull is the maximal volume of the tracking subject, which is a superset of the true volume. It does not determine a unique posture but multiple ones; 2) Silhouettes generated by image segmentation are not reliable but noisy. Visual hulls from noisy silhouettes are often corrupted and inconsistent. As a consequence, optimisation algorithms are more probably to be trapped in local maxima, leading to incorrect gestures. Moreover, the global maximum of the observation likelihood can be shifted because of the corrupted data. In the Annealed Particle Filtering based tracking framework, a sufficiently slow annealing schedule can be taken as a measurement to rescue particles from the attraction of local minima. With such an annealing schedule, the initial energy function is shaped in a way that local minima are flattened out whereas the global minimum becomes relatively more pronounced. Also, particles will have more evaluations and random perturbations to move into the neighbourhood of the global mode. However, a sufficiently slow annealing schedule is usually computationally intractable in practice. Besides designing more efficient annealing schedules, a better-behaved observation likelihood function is also vital important to guarantee that optimisation converges to the true posture. In this work, we propose an efficient and robust observation likelihood function for human motion tracking: 1) It employs an appearance-based likelihood evaluation with an illumination-invariant Mutual Information (MI) [6] criterion (in Section 3.1) to supersede the silhouette-based evaluation. In doing so, the likelihood evaluation can avoid the artifacts or noises introduced by silhouette segmentation. Also, with extra appearance information introduced, the landscape of the observation likelihood function becomes well modelled. These make the true posture more likely to correspond to the global mode of the likelihood function; 2) More importantly, a gradual sampling scheme is developed (in Section 3.2) for the Annealed Particle Filtering (APF) based tracking framework. It is a smart selection of the observed image data to be included in the evaluation of the observation likelihood function. The selection is based on the error distribution over the observed image in the previous exploration. This scheme works in an error-oriented fashion, allowing the evaluations of observation likelihood to concentrate on major and relevant errors. It is able to produces better tracking performance and higher computational efficiency with a common annealing schedule. The excellent tacking performance obtained by our approach will be verified by the experiments on the benchmark data sets and compared with that given by the state-of-the-art ones in the literature.
2
Our Appearance-Based Body Template
Our appearance-based body template incorporates the colour texture information as priori into tracking. This allows our approach to avoid less robust silhouette fitting procedure and achieve better tracking performance. The textured
556
Y. Lu et al.
Fig. 1. From left to right: the articulated skeleton parameterised by 25 DOF, the visual hull of HumanEvaII Subject 4 constructed by Octree-based voxel colouring and the textured template model after manual refinements
body template in our work uses a standard articulated-joint parametrisation to describe the human pose, further leading to an effective representation of the human motion over time. The articulated skeleton consists of 10 segments and is parameterised by 25 degrees of freedom (DOF) in Figure 1. It is registered to a properly scaled template skin mesh by Skeletal Subspace Deformation (SSD)[7]. Then, shape details and texture are recovered by the following procedure: Initially, Octree-based volumetric reconstruction with photo consistency [8] is used to recover the textured visual hull as shown in Figure 1. The vertices of the skin mesh are then registered to voxels by using Iterative Closest Point (ICP) method [9] with manual interactions to adjust misalignments. Colour textures are also assigned by corresponding skin vertices to textured voxels. The colour of a vertex that is invisible from all views is assigned to the same colour as the nearest visible vertex. At last, the template model is imported to commercial software to be finalised according to the real subject. The example of the final template model is illustrated in Figure 1.
3
Proposed Illumination-Invariant and Error-Oriented Evaluation
The proposed approach resides on the APF framework that is first introduced in markerless motion capture by Deutscher et al. [10]. Markerless motion capture can be formulated as a Recursive Bayesian filter framework [11]. It estimates a probability distribution recursively over time using incoming observations and temporal dependencies. Mathematically, this process can be expressed as p(xt |y1:t ) ∝ p(yt |xt )
p(xt |xt−1 )p(xt−1 |y1:t−1 )dxt−1
(1)
where xt denotes a pose configuration at time t, yt an observation at time t, xt−1 the previous state, and y1:t the collection of observations from time 1 to time t. Because the above integral does not have a closed form solution in general, the posterior distribution is often approximated by Sampling techniques. Simulated
Gradual Sampling and Mutual Information Maximisation
557
annealing [12] is incorporated into APF to maximise the observation likelihood p(yt |xt ) that measures how well a particle (a pose configuration) xt fits the observation yt at time t. The observation likelihood is often formulated in a modified form of the Boltzmann distribution: p(yt |xt ) = exp{−λE(yt , xt )}
(2)
where the annealing variable λ is defined as 1/(kB Tt ), an inverse of the product of the Boltzmann constant kB and the temperature Tt at time t. Maximising the observation likelihood p(yt |xt ) becomes minimising an energy function E(yt , xt ). The optimisation of APF is iteratively done according to a predefined M -phase annealing schedule {λ = λ1 , ..., λM }, where λ1 < λ2 < ... < λM . At time t, considering a single phase m, initial particles are outcomes from the previous phase m − 1 or drawn from the temporal model p(xt |xt−1 ). Then, all particles are weighted by the observation likelihood p(yt |xt ), and are resampled probabilistically to identify good particles that are highly likely to be near the global optimum. Subsequently, particles are perturbed to new positions for the next iteration by a Gaussian noise with a covariance matrix Pm . Eventually, particles are expected to concentrate on the neighbourhood of the global optimum of the posterior distribution, or equally, the energy function. As seen, the energy function E(yt , xt ) plays a critical role. In the following sections, we improve the robustness, accuracy and computational efficiency in optimising E(yt , xt ) through Mutual Information maximisation and error-oriented gradual sampling. 3.1
Maximisation of Mutual Information
Given multi-view image observations, the estimated pose, and the pre-built template human model, the likelihood evaluation is performed by comparing the observed images against the synthesised images. A synthesised image is obtained by projecting the template human model onto a mean static background image based on camera calibration parameters and a given pose configuration. Figure 2 shows an observed image and a synthesised image from a tracked example in the HumanEvaII dataset [13]. A direct comparison of the two images
Fig. 2. From left to right: an observed image, and a synthesised image
558
Y. Lu et al.
in the commonly used RGB colour space will not be robust and is often affected by lighting conditions and appearance differences between the template model and the real subject. In this work, we employ robust image similarity metrics developed in the literature to overcome this problem. Especially, the work by Viola et al. [14] suggests that MI metrics are reliable for evaluating models with substantial different appearances and even robust with respect to variations of illumination. Mutual Information is a quantity that measures the mutual dependence of two variables. Considering two images Ψ , Ω, and their pixels ψ, ω as random variables, Mutual Information can be expressed in terms of entropy as: I(Ψ ; Ω) = H(Ψ ) + H(Ω) − H(Ω, Ψ ) =− p(ψ) log p(ψ) − p(ω) log p(ω) + p(ψ, ω) log p(ψ, ω) ψ∈Ψ
=
p(ψ, ω) log
ω∈Ω ψ∈Ψ
ω∈Ω
p(ψ, ω) p(ψ) p(ω)
ω∈Ω ψ∈Ψ
where H(Ψ ) and H(Ω) denote the marginal entropies, H(Ψ, Ω) the joint entropy, and p() the probability density function. In our case, p() is approximated by using the Parzen Window method with the Gaussian functions: 1 (ψ − ψi )2 1 √ exp − p(ψ) ≈ N 2σ 2 2πσ 2 ψi ∈W T
1 ψ − ψi 1 1 −1 ψ − ψi exp − Σ p(ψ, ω) ≈ ω − ωi N 2 ω − ωi 2π |Σ| ψi ∈Wψ ,ωi ∈Wω
where N denotes the number of samples in the window W or Wψ , Wω . Σ is assumed as a diagonal covariance matrix. To be more robust to lighting conditions, MI is computed in the CIELab colour space in our work. Overall, the energy function E(yt , xt ) can be summarised as: E(yt , xt ) =
1 Nview
Nview
i=1
1 (3) kL IL (IMyi t ; IMxi t ) + ka Ia (IMyi t ; IMxi t ) + kb Ib (IMyi t ; IMxi t )
where IMyi t denotes the ith view observed image yt at time t, IMxi t the ith view synthesised image produced by projecting the estimate state xt at time t, and IL (), Ia () and Ib () the MI criterion values calculated in the channel L, a and b, respectively. Also, kL , ka and kb denote the coefficients that control the weights of the L, a and b channels. Usually, kL is set to be small in order to suppress the illumination influence. 3.2
Gradual Sampling Scheme for Annealed Particle Filter
The key ideas of our gradual sampling scheme are: 1) the precision of the energy function evaluations changes adaptively with the process of annealing. We blur
Gradual Sampling and Mutual Information Maximisation
559
Fig. 3. The top diagram shows the error-oriented gradual sampling with 6 layers, and the bottom pictures illustrate that large errors among 200 particles gradually concentrate on the tracking subject through optimisation
observed and synthesised images at the early layers and use the blurred versions to evaluate the energy function. We observe that this is able to flatten the shape of the energy function, allowing a large number of particles to survive and encouraging a broader exploration. Then we gradually increase the resolution of the observed and synthesised images and use them in function evaluation at the later layers. This will provide more precise information to correctly single out those close to the global optimum; 2) With the increase of layers, the majority of large errors will progressively concentrate on the tracking subject and even only on some body areas where the observed and synthesised images have not been well aligned, as shown in the bottom of Figure 3. Taking this situation into account, we proposed a “smart” selection of the image areas (or extremely, the pixels) to be used in energy function evaluation. We weigh different image areas based on the magnitude of the error from them in the last layer. Two criteria are used: i) if the error in an area is small (it means that the observed and synthesised have been well aligned in this region), we lower its weight; ii) if the area is far from the centre of the projection of the human body, we lower its weight. For an area with lower weight, we will sample a smaller number of its pixels to be included in the energy function evaluation. This is where the name “Gradual Sampling” comes from. Following the above ideas, we partition the synthesised and observed images into a number of small-sized blocks, and compute the distribution of error over them. The error-oriented block selection and pixel sampling use the distribution from the last layer to guide the current function evaluation. This allows the evaluation to focus more on the large-error areas and at the same time reduce the number of pixels included in evaluation. The whole process relies on a measurement
560
Y. Lu et al.
Algorithm 1. Gradual Sampling for a typical frame at time t Require: The survive rate αm in APF [10], a set of predefined βm , observation yt , the total number of layers M , and the initial covariance matrix P0 . for m = 1 to M do 1: Initialise N particles x1t , · · · , xN t from the last layer or the temporal model; 2: Evaluate E(yt , xt ) for each particle with the N EPi pixels sampled from each block of blurred/original images. Average the error statistics over all particles; 3: Compute N EPi for each block with the error statistics and equation (4)
2 N i 2 i 4: Calculate λm by solving αm N N where i=1 (wt,m ) = i=1 wt,m i = exp{−λm E(yt , xit,m )} and N is the number of particles; wt,m 5: Update weights for all particles using exp{−λm E(yt , xt )}. 6: Resample N particles from the updated importance weight distribution. 7: Perturb particles by Gaussian noise with covariance Pm = Pm−1 αm . end for
called Number of Effective Pixels1 (NEP in short) defined for the ith block. In the layer m, the N EP for the ith block is expressed as: ⎧ ⎨Ni · ηm,i
||ci − Cbo ||2 < σ (4) N EP m,i = ci −Cbo 22 1 √ ||ci − Cbo ||2 σ. N · η · exp − ⎩ i m,i 2σ2 2πσ2 Here, ηm,i is one factor controlling the number of pixels sampled from the ith block. It is proportional to the ratio of the average error from the ith block (denoted by errm,i ) to the maximal average error over all blocks (denoted by maxi (errm,i )). In doing so, N EPi for the block with smaller error will decrease relatively fast, whereas N EPi for the block with greater error will maintain relatively large and decrease more slowly. This helps to gradually concentrate on misalignments. Also, considering that misalignments between the template model and real subject are gradually reducing with the annealing, we will progressively decrease the total number of pixels involved in function evaluation. This speeds up function evaluation and the tracking speed. To realise this, ηm,i is also proportional to a predefined βm , which decreases with the number of the layer, m. Therefore, ηm,i is mathematically expressed as ηm,i = βm ·
errm,i . maxi (errm,i )
(5)
Equation (4) also takes the distance of the ith block from the tracking subject into account. We use a Gaussian distribution based weighting scheme to exponentially decay the importance of the ith block. The farther the centroid of the ith block (denoted by ci ) from the centroid of the human body template projected onto the image plane (denoted by Cbo ), the less important this block is and the smaller number of sampled pixels is. These makes the energy function evaluation well concentrate on the error around the tracking subject. σ denotes 1
Pixels with large errors often correspond to misalignments of the subject, and they contribute to “effective” measurements.
Gradual Sampling and Mutual Information Maximisation
561
the standard deviation of this Gaussian distribution. It will be empirically set at the beginning of the tracking based on the scattering radius of the projection of the human body template in a “T” gesture. We illustrate a typical procedure of the error-oriented pixel sampling in the top of Figure 3. Our gradual sampling procedure at time t is outlined in Algorithm 1.
4
Experiments and Discussion
Experiments are performed on the benchmark dataset [13], HumanEvaI that contains 4 grayscale and 3 colour calibrated video streams synchronised with Mocap data at 60Hz, and HumanEvaII that contains 4 colour calibrated image sequences synchronised with Mocap data at 60Hz. The tracking results are evaluated against the groundtruth Mocap data to obtain the absolute mean joint centre position errors and standard deviations. The experimental results with 50 particles and 10 layers on the 443-frame trail 1 of the subject 3 walking sequence in HumanEvaI is plotted in the top of Figure 4. The proposed method using only 3 colour video streams is able to maintain 64.5 ± 8.2mm (using both MI and GS), 69.4 ± 9.9mm (without using MI) and 68.6 ± 8 (without using GS). The difference among them demonstrates the effect of incorporating MI and GS in human motion tracking. All of the three methods outperform the silhouettebased method2 which merely achieves 78 ± 12.8mm using 7 video streams. To better verify the accuracy of our tracking result, we project the estimated poses onto another 4 novel views unused in our tracking algorithm. As shown in Figure 6, the projections match the figures of the real subject very well. Table 1. Absolute mean joint position errors on HumanEvaII Subject 4 from different research groups Study Particles Layers Errors(ave ± std)mm comments [15] 250 15 32 ± 4.5 two-pass optimisation with smoothing Ours 200 10 54.6 ± 5.2 MI and Gradual sampling [16] 800 10 50-100 hierarchical approach [17] 200 5 80 ± 5 Bi-directional silhouette-based [18] N/A N/A over 170 mix learning and tracking
Another experiment uses 200 particles and 10 layers for the longer 1257-frame combo sequence of the subject 4 in HumanEvaII. The ground truth Mocap data is withheld by the data set owner and only available for the online evaluation. We submitted our tracking result and obtained the online evaluation results as follows. As shown, in the middle of Figure 4 the proposed method is able to achieve 54.6 ± 5.2mm. Without the support of MI and GS, errors rise to 64.3 ± 12.2mm and 59.38 ± 5.5mm, respectively. In contrast, the silhouette-based method with the same settings can only achieve 90.7 ± 16.7mm and the maximum error 2
The silhouette based method uses only the silhouette feature.
562
Y. Lu et al.
Fig. 4. From the top to bottom, 1) tracking results on HumanEvaI the Subject 3 Walking Sequence, 2) the Subject 4 of HumanEvaII Combo Sequence, 3) computational time comparison for the proposed and the baseline method with different particles and layers, and 4) the convergence speed of Gradual Sampling versus the Common APF method. Note that the ground truth data is corrupted at 91-108 and 163-176 frames on HumanEvaI and at 298-335 frames on HumanEvaII.
Gradual Sampling and Mutual Information Maximisation
563
Fig. 5. Accurate tracking results from HumanEvaII Subject 4. The average Euclidean error of 3D joint positions is less than 55mm.
reaches about 170mm. This shows the advantage of our appearance-based likelihood evaluation function over a silhouette-based one. Although jogging from the frames 400 to 800 is more difficult to track than slow movements of walking and balancing, the proposed method using MI can still track stably when other two methods without MI experience drastically fluctuation. As illustrated in Table 1, several research groups [15–18] have evaluated their results against the subject 4 of HumanEvaII. Our results achieve the second best performance overall. The work in [18] proposed a learning based approach with errors over 170mm, which is relatively inaccurate when compared with APF based approaches. The work in [17] utilised bi-directional silhouette-based evaluation and achieved 80 ± 5mm. However, it relies on the quality of silhouette segmentation. The work in [16] proposed a hierarchical approach that employs a relative large number of evaluations to achieve errors within 50 − 100mm. The method in [15] can be expected to perform better than ours because they utilise two-pass optimisation. In the second pass, a smoothing process with respect to future frames is used. These are not taken in our approach because two-pass optimisation incurs more computational overhead and limits its applicability to real-time tracking. Moreover, as pointed out in [5, 16, 18], when the error is less than 50mm, the actual tracking error will not be measurable because of the limited precision of the joint centres’ positions estimated from the Mocap data, which is considered as ground truth, and the error between the human model and the real subject3 . Hence, considering this context, our performance of 54.6 ± 5.2mm has almost been the best possible. Also, this context explains why there has been about 50mm error for our initial pose although it is accurately set. More results are presented in Figure 5. The performance experiment is set up on a dual core Windows system with 2.8GHz CPU and 4G RAM. The average one frame computational time is 3
Note that there are no markers corresponding to actual joint centres in the Mocap data. As a result, the joint centres’ positions cannot be recovered very accurately from the Mocap data.
564
Y. Lu et al.
Fig. 6. Tracking results from HumanEvaI Subject 3. It shows 4 novel camera views which are not used in tracking.
compared with the benchmark baseline algorithm4 [17] which is the only publicly available implementation for the HumanEva dataset. We run both algorithms in different combinations of the layers and particles and the results are shown in the bottom of Figure 4. Despite of the extra computational overhead due to the use of Mutual Information criterion, our method with gradual sampling can still achieve almost 10 times faster than the baseline algorithm in [17].
5
Conclusion and Future Work
Our method introduces 1) A robust mutual information maximisation that utilises more visual information, reduces influences from illumination variations and supersedes unreliable image segmentation from the tracking pipeline; 2) A gradual sampling scheme for APF-based tracking framework to gradually focus on resolving the major mismatches around the tracking subject and effectively allocate the computational resource accounting for the error distribution and the hierarchical order of the human anatomy. Experiments with the benchmark datasets demonstrate that our method has very competitive performance among the other state-of-the-art studies from different research groups. Markerless motion capture in our research can be further improved in future work. The acquisition of the appearance based human body model using a few multi-view images currently heavily relies on manual interactions. Also, the global optimisation is one major bottleneck preventing markerless motion capture from real time applications. These open problems indicate the directions of our future research. 4
Available online via http://vision.cs.brown.edu/humaneva/baseline.html
Gradual Sampling and Mutual Information Maximisation
565
Acknowledgement. Authors would like to thank the support from National ICT Australia, and Leonid Sigal from Brown University makes the HumanEva dataset available.
References 1. Balan, A.O., Sigal, L., Black, M.J., Davis, J.E., Haussecker, H.W.: Detailed human shape and pose from images. In: CVPR (2007) 2. de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. In: ACM SIGGRAPH, pp. 1–10 (2008) 3. Starck, J., Hilton, A.: Surface capture for performance based animation. IEEE Computer Graphics and Applications 27(3), 21–31 (2007) 4. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE TPAMI 16, 150–162 (1994) 5. Corazza, S., M¨ undermann, L., Gambaretto, E., Andriacchi, T.P.: Markerless motion capture through visual hull, articulated icp and subject specific model generation. International Journal of Computer Vision 87, 156–169 (2010) 6. Shannon, C.E.: A mathematical theory of communication. Bell Systems Technical Journal 27(4), 623–656 (1948); Continued 27(4), 623–656 (October 1948) 7. Magnenat-Thalmann, N., Laperri`ere, R., Thalmann, D.: Joint-dependent local deformations for hand animation and object grasping. In: Proceedings on Graphics Interface 1988. Canadian Information Processing Society, pp. 26–33 (1988) 8. Seitz, S.M., Dyer, C.R.: Photorealistic scene reconstruction by voxel coloring. In: CVPR, p. 1067. IEEE Computer Society, Los Alamitos (1997) 9. Besl, P.J., McKay, N.D.: A method for registration of 3-d shapes. IEEE TPAMI 14, 239–256 (1992) 10. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: CVPR, vol. 2, pp. 126–133 (2000) 11. Doucet, A., Godsill, S., Andrieu, C.: On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing 10, 197–208 (2000) 12. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 13. Sigal, L., Black, M.J.: Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical report, Brown University, Department of Computer Science (2006) 14. Viola, P., Wells III, W.M.: Alignment by maximization of mutual information. International Journal of Computer Vision 24, 137–154 (1997) 15. Gall, J., Rosenhahn, B., Brox, T., Seidel, H.P.: Optimization and filtering for human motion capture - a multi-layer framework. IJCV 87, 75–92 (2010) 16. Bandouch, J., Beetz, M.: Tracking humans interacting with the environment using efficient hierarchical sampling and layered observation models. In: IEEE ICCV Workshop on Human-Computer Interaction, HCI (2009) 17. Sigal, L., Balan, A., Black, M.: Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision (2010) 18. Cheng, S.Y., Trivedi, M.M.: Articulated human body pose inference from voxel data using a kinematically constrained gaussian mixture model. In: CVPR Workshop on EHUM2. IEEE, Los Alamitos (2007)
Temporal Feature Weighting for Prototype-Based Action Recognition Thomas Mauthner, Peter M. Roth, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology {mauthner,pmroth,bischof}@icg.tugraz.at
Abstract. In action recognition recently prototype-based classification methods became popular. However, such methods, even showing competitive classification results, are often limited due to too simple and thus insufficient representations and require a long-term analysis. To compensate these problems we propose to use more sophisticated features and an efficient prototype-based representation allowing for a single-frame evaluation. In particular, we apply four feature cues in parallel (two for appearance and two for motion) and apply a hierarchical k-means tree, where the obtained leaf nodes represent the prototypes. In addition, to increase the classification power, we introduce a temporal weighting scheme for the different information cues. Thus, in contrast to existing methods, which typically use global weighting strategies (i.e., the same weights are applied for all data) the weights are estimated separately for a specific point in time. We demonstrate our approach on standard benchmark datasets showing excellent classification results. In particular, we give a detailed study on the applied features, the hierarchical tree representation, and the influence of temporal weighting as well as a competitive comparison to existing state-of-the-art methods.
1
Introduction
Recently, human action recognition has been of growing interest in Computer Vision, where typical applications include visual surveillance, human computer interaction, or monitoring systems for elderly people. Thus, a variety of approaches have been proposed introducing new features, representations, or classification methods. Since actions can be described as chronological sequences, special attention has been paid to how to incorporate temporal information. In general, this can be realized either by keeping a very strict spatial-temporal relation on the features (e.g., by spatio-temporal volumes [1, 2] or descriptors [3–5]) or on the representation level [6, 7]. For such approaches the classification is typically performed on single-frame basis and the analysis of longer sequences is based on a simply majority voting or on averaging over multiple frames. If spatial-temporal consistency is totally ignored [8] only whole sequences can be analyzed. One effective way for directly describing temporal information, that showed great success in the past, is the usage of prototypes, e.g., [6–9]. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 566–579, 2011. Springer-Verlag Berlin Heidelberg 2011
Temporal Feature Weighting for Prototype-Based Action Recognition
567
In general, prototype-based learning methods can be described by a prototype space X = {x1 , . . . , xn }, which is defined as a set of representative samples xj describing the data (prototypes), and a distance function ρ [9]. In particular, for action recognition the data is split into a smaller set of reference templates referred to as prototypes [7], key-poses [8], or pose-primitives [6]. Weiland and Boyer [8] used foreground segmentations to create a set of silhouette exemplars, so called key-poses, using a forward selection process. The final action description is achieved by comparing the occurrence frequency of key-poses in a video. Although they presented excellent results, they completely neglected the temporal ordering of prototypes within a sequence and showed only recognition results on complete videos. Similarly, Elgammal et. al. [9] modeled an action as a sequence of silhouette prototypes using an HMM for incorporating temporal constrains and being more robust to small deviations. To incorporate temporal context in a prototype-based representation, Thurau and Hlavac [6] introduced n-grams models. They define sub-sequences of n frames and describe the transitions between prototypes by n-dimensional histograms. However, the required number of samples to fill the n-dimensional histograms is high and the temporal ordering is very strict. Furthermore, the representation of n-grams is getting difficult if n > 3. Experimentally, they showed state-of-the-art results on sequences with lengths of around 30 frames. Since shape information clearly gives only limited information on single-frame basis Lin et al. [7] recently proposed to create prototypes in a joined shape-motion space using binary foreground silhouettes for shape and flow features as introduced in [10]. The prototypes are trained and represented efficiently using hierarchical k-means, leading to realtime evaluation performance. Finally, the temporal information is incorporated using Dynamic Time Warping (DTW). Although DTW is a powerful method for aligning temporal sequences, as a drawback it only compares one sequence to another and cannot handle transitions between different actions. Again only results on sequence level are shown. Even though showing competitive recognition results, existing prototype-based action recognition methods are limited due to a required long-term analysis (i.e., on a whole sequence) and mainly rely on accurate segmentations – at least for learning the shape prototypes (e.g., [6–8]). In praxis, however, perfect segmentations are often not available and short and fast actions such as in sports analysis should be recognized. Hence, the goal for human action recognition should be to robustly classify on a short sequence length. Schindler and van Gool [5] showed that if a more sophisticated representation is used human action recognition can also be performed from very short sequences (snippets). They use motion and appearance information in parallel, where both are processed in similar pipelines using scale and orientation filters. The thus obtained features are then learned by using a Support Vector Machine. Their approach showed impressive results, reaching state-of-the-art results even though only short sequences of 5-7 frames are used. Hence, the goal of this paper is to introduce an efficient action recognition approach working on short-frame level that takes advantage of prototype-based
568
T. Mauthner, P.M. Roth, and H. Bischof
representations such as fast evaluation, multi-class capability, and sequential information gain. In particular, we propose to use four information cues in parallel, two for appearance and two for motion, and to independently train a hierarchical k-means tree [11] for each of these cues. To further increase the classification power, we introduce a temporal weighting scheme from temporal co-occurrences of prototypes. Hence, in contrast to existing methods that are using different cues (e.g., [5, 12]) we do not estimate global weights, which allows us to temporally adapt the importance of the used feature cues. Moreover, even using temporal context we can still run a frame-wise classification! The benefits of the proposed approach are demonstrated on standard benchmark datasets, where competitive results on short frame basis as well as when analyzing longer sequences are shown. The reminder of the paper is organized as follows. First, in Section 2 we introduce our new action recognition approach consisting of an efficient prototypebased representation and a temporal feature weighting scheme. Next, in Section 3, we give a detailed analysis of our method and compare it to existing approaches on standard datasets. Finally, we summarize and conclude the paper in Section 4.
2
Temporal Action Learning
In the following, we introduce our new prototype-based action recognition approach which is illustrated in Fig. 1. To gain different kind of information, we apply four feature cues in parallel, two for appearance and two for motion (Section 2.1). For these cues we independently train hierarchical k-means trees [11], which provide several benefits such as very efficient frame-to-prototype matching and an inherent multi-class classification capability (Section 2.2). To incorporate temporal information, we further estimate temporal weights for the different feature cues. In particular, from temporal co-occurrences of prototypes we learn a temporal reliability measure providing an adaptive weight prior for the evaluation step (Section 2.3). In this way during evaluation at a specific point in time the most valuable representations get higher temporal weights increasing the overall classification power (Section 2.4). 2.1
Features
In contrast to existing prototype-based action recognition methods, which mainly use segmentation results to represent the data, we apply four more sophisticated feature cues in parallel, two describing the appearance and two describing the motion, respectively. In particular, for appearance these are the Histogram of Gradients (HOG) descriptor [13] and the Locally Binary Patterns (LBP) descriptor [14]. HOG estimates a robust local shape description using a histogram binning over the gradient orientation a local normalization whereas LBPs, originally introduced for texture description, are valuable due to invariance to monotonic gray level changes and robustness to noise. Thus, both have shown to be
Temporal Feature Weighting for Prototype-Based Action Recognition
569
Fig. 1. Prototype-based action recognition: For each feature cue f a hierarchical k-means tree T f is estimated, where the leaf nodes of T f are treated as prototypes ϕ. In addition, to allow for a prototype-based classification, for each prototype ϕ the probabilities p(c|ϕ) are estimated, where c is the corresponding action class.
very valuable for human detection as well as for action recognition. To describe the motion information, we adapted both methods to describe a motion field obtained from an optical flow estimation: Histogram of Flow (HOF) and Locally Binary Flow Patterns (LBFP). In the following, we give the details on these descriptors given the image It ∈ Rm×n at time t. HOG. As first step in the HOG calculation, we have to estimate the gradients gx (x, y) and gy (x, y). For each position (x, y) the image It is filtered by 1-dimensional masks [−1, 0, 1] in x and y direction [13]. Then, we calculate the magnitude m(x, y) and the signed orientation ΘS (x, y): m(x, y) =
gx (x, y)2 + gy (x, y)2
ΘS (x, y) = tan−1 (gy (x, y)/gx (x, y)) .
(1) (2)
To avoid problems due to intensity changes and to make the descriptor more robust, we transform the signed orientation ΘS into an unsigned orientation: ΘU (x, y) =
ΘS (x, y) + π ΘS (x, y)
θS (x, y) < 0 otherwise .
(3)
To estimate the HOG descriptor, we divide the image It into non-overlapping cells of size 10 × 10. For each cell, the orientations ΘU are quantized into 9 bins and weighted by their magnitude m. Groups of 2 × 2 cells are combined in
570
T. Mauthner, P.M. Roth, and H. Bischof
overlapping blocks and the histogram of each cell is normalized using the L2norm of the block. The final descriptor is built by concatenation of all normalized blocks. For speed issues we avoid the tri-linear interpolation. HOF. In contrast to HOGs, which are estimated from only one frame, for estimating HOFs the motion has to be estimated from two frames. In particular, to estimate the optical dense flow field, we apply the method proposed in [15], which is publicly available via OFLib 1 . In fact, the GPU-based implementation allows a real-time computation of the features. Given It , It−1 , the optical flow describes the shift from frame t−1 to t with the disparity Dt , where dx (x, y) and dy (x, y) denote the disparity components in x and y direction at location (x, y). We then compute the HOF descriptor similar as described above by applying Eqs. (2) and (3). However, the gradients gy (x, y) and gx (x, y) are replaced by the disparity components dy (x, y) and dx (x, y). Moreover, to capture different motion directions for same poses, we use the signed orientation ΘS and quantize the orientation into 8 bins. The other parameters such as cell/block combination are the same as described above. LBP. An LBP pattern p is constructed by binarization of intensity differences between a center pixel and a number of n sampling points with radius r. The pattern p is assigned 1 if the intensity of a sampling point has a higher intensity than the center pixel and 0 otherwise. The final pattern is formed by the 0 − 1 transitions of the sampling points in a given rotation order. To avoid ambiguities due to rotation and noise we restrict the number of allowed 0 − 1 transitions to a u . For our final description maximum u, hence, defining uniform patterns LBPn,r 4 we build LBP8,1 pattern histograms for each cell and sum up the nonuniform patterns to one bin (see [14] for more details). To finally estimate the LBP descriptors, similar to [16], we keep the cell-based splitting of the HOGs and extract pattern histograms as described before for each cell. LBFP. Motivated by the relation between HOG and HOF we directly apply LBPs on optical flow as well. Integrating direction information into LBP descriptors is quite difficult since due to noise in the dense optical flow field the orientation information is sometimes misleading. However, LBPs are known to be robust against such clutter, also appearing in texture, and therefore they are a considerable choice for an additional complementary flow feature. In particular, we keep the cell structure of the appearance LBP and compute the LBFP histograms on the flow magnitude m(x, y), which is computed using Eq. (1) from dy (x, y), dx (x, y). Although the description is slightly simpler compared to HOF, it is more robust in presence of noise. In general, the same parametrization as for LBP is used. Please note that LBFP are not related to Local Trinity Patterns [17], which are computed on video volumes. 2.2
Learning a Prototype-Based Representation
Having the feature descriptions discussed in Section 2.1 with respect to a prototype-based representation two main issues have to be considered. First, how to 1
http://gpu4vision.icg.tugraz.at/
Temporal Feature Weighting for Prototype-Based Action Recognition
571
select a representative set of prototypes. Second, if the number of prototypes is increasing, simple nearest neighbor matching gets infeasible and a more efficient method is required. In particular, we solve both problems by applying a hierarchical k-means clustering, which is also known as Vocabulary Tree [11]. Given the training set S, we first perform a k-means clustering on all training samples s. According to the cluster indices, the data S is then split into subsets (branches of the vocabulary tree), and each subset is clustered again using kmeans clustering. This process is repeated recursively until no samples are left in a branch of the tree or if the maximum depth is reached. The thus obtained leaf nodes of the tree are then treated as prototypes ϕ. Hence, only two parameters are required: the split number k and a maximum hierarchy depth L, allowing to generate a maximum number of k L prototypes. During evaluation, a test sample is matched to a prototype by traversing down the tree, using depth-first-search, until it reaches a leaf node. As illustrated in Fig. 1, we independently build a vocabulary tree T f for each feature cue f . Thus, for each cue f we obtain prototypes ϕfj , i.e., the leaf nodes of tree T f , from which we can build the prototype sets Φf = ϕf1 , . . . , ϕfN . To enable a multi-class classification (i.e., one class per action), we have to estimate the class probability distribution p(c|ϕ) for all prototypes ϕ. Let Sc ⊂ S be the set of all samples belonging to class c and Sϕ,c ⊂ Sc be the set of all samples belonging to class c matching the prototype ϕ. Then the probability that a sample s matching the prototype ϕ belongs to class c can be estimated |S | . If no samples from class c reached the prototype ϕ, i.e., as p(c|ϕ) = |Sϕ,c c| |Sϕ,c | = 0, the probability is set to p(c|ϕ) = 0. Illustrative thus obtained classification results are shown in Fig. 2. The first row gives a color-coded view of different actions whereas in the second row (a) the corresponding prototypes and (b) the correct classifications are visualized. It can be seen that for correct classifications over time different prototypes are matched, leading to representative prototype sequences. This clearly shows that the variety in the data can be handled well by using our prototype-based description. 2.3
Learning Temporal Weights
However, from Fig. 2(b) it also can be recognized that the classification results for the single cues are very weak. Hence, the goal would be to combine these results to improve the classification results. The naive approach to fuse the results from different information cues would be to use majority voting or to estimate a mean over all decisions. Such approaches, however, totally neglect the information given by temporal constraints. Thus, in the following we introduce a more sophisticated information fusion strategy based on temporal weighting. The main idea is to exploit the knowledge which cues provided reliable results during training to assign temporal adaptive weights to the feature cues during evaluation.
572
T. Mauthner, P.M. Roth, and H. Bischof
(a) prototype sequences
(b) classification
Fig. 2. Single cue prototype-based classification: (a) sequences (color-coded prototype numbers) of matched prototypes for each feature cue and (b) classification results, where red indicates a correct classification (second row). The actions (first row) are color-coded in the range 1 − 10, respectively.
Given the prototype sets Φf the key idea is to estimate the reliability of a feature cue for the prototype transitions ϕfi → ϕfj . This is similar to prototype frequencies or transition statistics as used in [6, 8], which, however, require long sequences to get sufficient data to estimate the discriminative votes. Instead, we consider these transitions only in a short time frame introducing temporal bags, which is illustrated in Fig. 3(a). A temporal bag2 bti,m is defined as set of m prototypes ϕj , which followed the prototype ϕi at time t: bti,m = ϕt+1 , . . . , ϕt+m . Once all bags bti,m were estimated (i.e., for each occurrence of ϕi ) these are combined to a global bag Bi = b1i,m , . . . , bTi,m , where T is the number of temporal bags bti,m . Then from Bi we can estimate the temporal co-occurrences of ϕi and ϕj . In particular, we calculate a co-occurrence matrix C, where ci,j integrates all cases T within Bi where a prototype ϕi was followed by ϕj : ci,j = t=1 |ϕj ∈ bti,m |. Having estimated the co-occurrence matrix C, we now can compute a temporal reliability measure wi,j . Let ni,j be the number of samples that were classified correctly by prototype ϕj ∈ Bi , then we set the reliability weight ni,j . to wi,j = ci,j This is illustrated in Fig. 3(a). The bag Bi contains 7 instances of ϕh and 8 instances of ϕj . While prototype ϕj classified all 8 frames correctly, ϕh provided the correct class for only two samples. Thus, yielding reliability weights of wi,h = 2/7 and wi,j = 1. If this procedure is repeated for all prototypes in all feature cues this finally yields to four co-occurrence matrices Cf and four reliability matrices Wf , which can then be used during the test stage as illustrated in Fig. 3(b). 2
Since these calculations are performed for each cue f , in the following for reasons of readability we skip the superfix f in the notation.
Temporal Feature Weighting for Prototype-Based Action Recognition
(a)
573
(b)
Fig. 3. Temporal weighting for feature cues: (a) during training the weights wi,j for temporal co-occurrences of prototypes ϕi and ϕj of a feature cue are estimated; (b) during evaluation these weights are used to temporally change the importance of that features cue
2.4
Recognition Using Temporal Weights
Once we have estimated the hierarchical trees T f and the prototype reliability matrices Wf as introduced in Sections 2.2 and 2.3, we can perform action recognition using the following classification problem: p(c|t) =
4
wtf p c|ϕft ,
(4)
f =1
where wtf is the weight of the feature cue f and ϕft the identified prototype for cue f at time t. The crucial step now is to estimate the weights wtf , which is illustrated in Fig. 3(b). For that purpose, we use the information given by the past, i.e., the identified prototypes per cue, to estimate temporal weights. In particular, considering a temporal bag of size m we estimate the prototype transitions ϕfi → ϕfj , where i = t − m, . . . , t − 1 and j = t. Based on these selections using the reliability matrices Wf we can estimate the m corresponding weights wi,j . Finally, the weight wtt is estimated by averaging the m weights wi,j over the temporal bag. This recognition process is demonstrated in Fig. 4, where the first row illustrates three actions, the second row the identified prototypes, and the last row the corresponding weights. It clearly can be seen that the same action is characterized by different prototypes and also that the weights are changing over time.
3
Experimental Results
In the following, we demonstrate our prototype-based action recognition approach, where we run several experiments on publicly available action recognition benchmark data sets, i.e., Weizmann and KTH. We first give a detailed analysis
574
T. Mauthner, P.M. Roth, and H. Bischof
Fig. 4. On-line adapted feature weights obtained from our temporal reliability measure: color-coded actions (first row), matched prototypes of each feature cue (second row), and estimated weights (third row)
Fig. 5. Examples from the Weizmann human action data set
of prototype-based action recognition and show that temporal information can be useful to improve the classification results. Then, we give a detailed comparison to recently published state-of-the-art action recognition approaches. In both cases, the given results were obtained by a leave-one-out cross-evaluation [18, 19] (i.e., we used all but one individuals for training and evaluated the learned model for the missing one). 3.1
Benchmark Datasets
Weizmann Dataset The Weizmann human action dataset [1] is a publicly available dataset, that originally contains 81 low resolution videos (180 × 144) of nine subjects performing nine different actions: running, jumping in place, bending, waving with one hand, jumping jack, jumping sideways, jumping forward, walking, and waving with two hands. Subsequently, a tenth action, jumping on one leg, was added [20]. Illustrative examples for each of these actions are shown in Figure 5. KTH Dataset. The KTH human action dataset, originally created by [3], consists of 600 videos (160 × 120), with 25 persons performing six human action in four
Temporal Feature Weighting for Prototype-Based Action Recognition
575
Fig. 6. Examples from the KTH action data set
(a)
(b)
Fig. 7. Classification results on the Weizmann data set with different numbers of prototypes by varying parameters for hierarchical k-means tree and temporal bags: (a) single frame results and (b) bag averaged results
different scenarios: outdoors (s1 ), outdoors with scale variation (s2 ), outdoors with different clothes (s3 ), and indoors (s4 ). Illustrative examples for each of these actions are shown in Figure 6. 3.2
Analysis of Prototype-Based Learning
First of all, we give a detailed analysis of our proposed prototype-based action recognition approach, where we analyze the influence of the parameters of the hierarchical k-means tree and the bag size for the temporal weighting. For that purpose, we run several experiments varying these parameters on the Weizmann data set. The corresponding results are given in Fig. 7. Three main trends can be recognized. First, increasing the temporal bag size, which was varied between 3 and 9 increases the classification accuracy. However, using a bag size greater than 5 has only little influence on the classification performance. Second, increasing the number of prototypes (using different tree parameters, i.e., split criteria and depth) increases the classification power. However, if the number of possible prototypes gets too large, i.e., too many leaf nodes are weakly populated, the classification power is decreased - the optimum number is around 28 . Third, it can be seen that using the proposed weighting scheme the single cue classification results as well as a naive combination can clearly be outperformed. In
576
T. Mauthner, P.M. Roth, and H. Bischof
addition, Fig. 7(b) shows that averaging the single cue classification results over the temporal bags almost reaches the classification result if the whole sequences are analyzed. Next, we compare different evaluation strategies in detail for the Weizmann as well as on the KTH data set: (a) single frame evaluation, (b) averaging the single frame results over temporal bags, (c) analyzing the whole sequence (using a majority voting). In addition, we show results (on single frame basis) without using the temporal weighting: (d) analyzing the best single feature cue and (e) a naive feature combination (majority voting). Based on the results in Fig. 7 for the remaining experiments we set the bag size to 5 and used a binary split criterion. The thus obtained results are summarized in Table 1 for the Weizmann data set and in Table 2 for the KTH data set. Table 1. Overview of recognition results on the Weizmann-10 data set using 2-means clustering on a maximal depth of 8 and 12, respectively 28 212
single frame 92.4% 92.4%
bag avg. 94.5% 94.2%
all video 97.8% 100.0%
best feature 60.1% 70.2%
comb. features 84.6% 89.9%
#proto. 122 614
From Table 1 the benefits of the proposed method can clearly be seen. In fact, considering a tree-size of 28 the best single feature cue provides a classification result of approximative 60%. If the four cues are naively combined the overall classification result can improved to 85%. However, using the proposed temporal weighting (using a bag size of 5) an improvement of the classification-rate of 7%, by further averaging over the bag even of 9% can be obtained. If the whole sequence is analyzed, we finally get a correct classification rate of 98%, which can further be improved to 100% if the tree depth is increased. The same trend can be recognized for the KTH data set in Table 2, where we split the results for the four sub-sets. In particular, there is a significant improvement using the proposed method compared to the best single feature cue and the naive combination. However, as can be seen, the single frame classification results can be improved only a little by averaging over the whole bag. However, if the whole sequences are analyzed still a considerable improvement can be recognized. Table 2. Overview of recognition results on KTH data set using 2-means clustering on a maximal depth of 8
s1 s2 s3 s4
single frame 92.7% 89.1% 93.4% 91.6%
bag avg. 92.7% 90.6% 94.5% 91.7%
all video 97.3% 94.7% 98.7% 98.7%
best feature 69.4% 59.0% 71.0% 63.0%
comb. features 90.1% 82.0% 88.0% 91.9%
#proto. 113 118 116 109
Temporal Feature Weighting for Prototype-Based Action Recognition
577
Table 3. Recognition rates and number of frames used for different approaches reported for the Weizmann data set. The best results are shown in bold-face, respectively. (a) Weizmann-09 method proposed
rec.-rate #frames
94.9% 100.0% Schindler [5] 93.5% & v. Gool 96.6% 99.6% Blank et al. [21] 99.6% Jhuang et al. [22] 98.8%
1/5 all 1/2 3/3 10/10 all all
method
(b) Weizmann-10 rec.-rate #frames
proposed
92.4% 94.2% 100.0% Thurau 70.4% & Hlav´ aˇc [6] 94.4% Gorelick et al. [20] 98.0% Lin et al. [7] 100.0% Fathi & Mori [23] 100.0%
1/5 5/5 all 1 30/30 all all all
Table 4. Recognition rates and number of required frames for different approaches reported for the KTH data set. The best results are shown in bold-face, respectively. method proposed
s1 92.7% 92.6% 97.3% Schindler & v. Gool [5] 90.9% 93.0% Lin et al. [7] 98.8% 97.5% Yao and Zhu[24] 90.1% Jhuang et al. [22] 96.0%
3.3
s2 89.1% 90.6% 94.7% 78.1% 81.1% 94.0% 86.2% 84.5% 87.2%
s3 86.1% 94.5% 98.7% 88.5% 92.1% 94.8% 91.1% 86.1% 91.7%
s4 average 91.3% 89.8% 91.7% 92.4% 98.7% 97.4% 92.2% 87.4% 96.7% 90.2% 95.5% 95.8% 90.3% 91.3% 91.3% 88.0% 95.7% 92.7%
# frames 1/5 5/5 all 1/2 7/7 all (NN) all (proto.) all all
Comparison to State-of-the-Art
Next, we give a comparative study of our approach compared to state-of-theart action recognition methods on the Weizmann and the KTH data set. Since different authors used different versions of the Weizmann data set, i.e., 9 vs. 10 actions, we split the Weizmann experiment into two parts. In particular, we compared our approach to Schindler & van Gool [5] and to Thurau & Hlav´ aˇc [6], which are the most similar methods to ours - also providing an analysis on short frame basis - and to recent methods reported the highest classification results. The thus obtained results are given in Tables 3–4. The best classification results when analyzing the whole sequence are set boldface, respectively. From Table 3 it can be seen that we obtain competitive results on short frame basis as well as when analyzing the whole sequence. In fact, it can be seen that we obtain comparable results to Schindler & v. Gool and that we clearly can outperform the approach of Thurau & Hlav´ aˇc on short frame basis. Moreover, when analyzing the whole sequence for both data sets we obtain classification results of 100%. Finally, we carried out the same experiments on the KTH data set showing the results in Table 4. Again, it can be seen that we obtain competitive
578
T. Mauthner, P.M. Roth, and H. Bischof
results on short frame basis as well as when analyzing the whole sequence, even (significantly) outperforming most state-of-the-art methods an all four data sets. In particular, also on this data set we outperform the approach of Schindler & van Gool on short frame basis and yield the best overall performance for the full sequence analysis!
4
Conclusion
In this paper, we addressed the problem of weighting different feature cues for action recognition. Existing approaches are typically limited due to a fixed weighting scheme, where most of the benefits get lost. In contrast, we propose a temporal weighting, where the weights for different features cues can change over time, depending on the current input data. In particular, we use a prototype-based representation, which has previously shown to provide excellent classification results. However, in contrast to existing methods using simple segmentation we apply more sophisticated features to represent the data. We compute HOG and LBP descriptors to represent the appearance and HOF and LBFP descriptors for representing motion information. For each of these feature cues we estimated a prototype-based representation by using a hierarchical kmeans tree allowing for very efficient evaluation. These cues are evaluated using temporal weights showing an increasing performance, which was demonstrated on standard benchmark datasets, i.e., Weizmann and KTH. Acknowledgment. This work was supported by the Austrian Science Fund (FWF) under the project MASA (P22299) and the Austrian Research Promotion Agency (FFG) under the project SECRET (821690) and the Austrian Security Research Programme KIRAS.
References 1. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Proc. ICCV (2005) 2. Ke, Y., Sukthankar, R., Hebert1, M.: Event detection in crowded videos. In: Proc. ICCV (2007) 3. Schueldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proc. ICPR (2004) 4. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Proc. PETS (2005) 5. Schindler, K., van Gool, L.: Action snippets: How many frames does human action recognition require? In: Proc. CVPR (2008) 6. Thurau, C., Hlav´ aˇc, V.: Pose primitive based human action recognition in videos or still images. In: Proc. CVPR (2008) 7. Lin, Z., Jiang, Z., Davis, L.S.: Recognizing actions by shape-motion prototype trees. In: Proc. ICCV (2009) 8. Weinland, D., Boyer, E.: Action recognition using exemplar-based embedding. In: Proc. CVPR (2008)
Temporal Feature Weighting for Prototype-Based Action Recognition
579
9. Elgammal, A., Shet, V., Yacoob, Y., Davis, L.S.: Learning dynamics for exemplarbased gesture recognition. In: Proc. CVPR, pp. 571–578 (2003) 10. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. ICCV (2003) 11. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proc. CVPR (2006) 12. Ikizler, N., Cinbis, R., Duygulu, P.: Human action recognition with line and flow histograms. In: Proc. ICPR, pp. 1–4 (2008) 13. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proc. CVPR (2005) 14. Ojala, T., Pietikinen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recognition 29, 51–59 (1996) 15. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 16. Wang, H., Ullah, M.M., Laptev, A.K.I., Schmid, C.: An HOG-LBP human detector with partial occlusion handling. In: Proc. ICCV (2009) 17. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: Proc. ICCV (2009) 18. Thurau, C., Hlav´ aˇc, V.: Pose primitive based human action recognition in videos or still images. In: Proc. CVPR (2008) 19. Schindler, K., van Gool, L.: Action snippets: How many frames does human action recognition require? In: Proc. CVPR (2008) 20. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. PAMI 29 (2007) 21. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Proc. ICCV, pp. 1395–1402 (2005) 22. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: Proc. ICCV (2007) 23. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: Proc. CVPR (2008) 24. Yao, B., Zhu, S.C.: Learning deformable action templates from cluttered videos. In: Proc. ICCV (2009)
PTZ Camera Modeling and Panoramic View Generation via Focal Plane Mapping Karthik Sankaranarayanan and James W. Davis Dept. of Computer Science and Engineering Ohio State University Columbus, OH, USA {sankaran,jwdavis}@cse.ohio-state.edu
Abstract. We present a novel technique to accurately map the complete field-of-coverage of a camera to its pan-tilt space in an efficient manner. This camera model enables mapping the coordinates of any (x, y) point in the camera’s current image to that point’s corresponding orientation in the camera’s pan-tilt space. The model is based on the elliptical locus of the projections of a fixed point on the original focal plane of a moving camera. The parametric location of this point along the ellipse defines the change in camera orientation. The efficiency of the model lies in the fast and automatic mapping technique. We employ the proposed model to generate panoramas and evaluate the mapping procedure with multiple PTZ surveillance cameras.
1
Introduction
Pan-tilt-zoom (PTZ) cameras are extensively used in wide-area surveillance applications. Tracking, monitoring, and activity analysis in such environments require continual coverage of the entire area by these cameras. While most computer vision algorithms such as object detection, tracking, etc. work on the Cartesian x-y space of images captured by these cameras, the movement and control of these cameras take place in an angular pan-tilt space. Therefore, in order to relay information across these spaces to comprehensively utilize the pan-tilt capabilities, a fast and accurate pixel to pan-tilt mapping model is required. For example, an active tracking system would require such a real-time mapping in order to continually track a target with a PTZ camera to keep the target centered in its view as it moves across the scene. Such a mapping would also need to be employed when constructing models for pedestrian activity patterns across an entire scene and not just in local views. In these applications, when processing images captured by the camera, there is a need to map the local image pixel coordinates to the corresponding pan-tilt orientations in world space. Although various complex camera models have been proposed for these tasks [1–5], we expand on the preliminary work of [6] and propose a real-time, reliable and completely automatic PTZ mapping model. We theoretically derive the minimum formulation necessary to map any arbitrary point on the focal plane R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 580–593, 2011. Springer-Verlag Berlin Heidelberg 2011
PTZ Camera Modeling and Panoramic View Generation
581
Fig. 1. High resolution spherical panorama generated using the proposed model showing complete field-of-coverage of a PTZ camera. Inset: A single view from the camera.
of a PTZ camera to its corresponding pan-tilt orientation, based on the current pan-tilt value of the camera. The principal idea that we exploit here is that the locus of a point on a fixed image plane, as the camera pans, forms an ellipse from which the pan and tilt can be recovered. Our formulation shows that an alternative (and simpler) linear field-of-view based mapping approach is not sufficient while at the same time demonstrates that more complex methods are not required. Therefore, while the proposed model we derive is computationally simple, it is not simplistic in its functionality (as established in the experiments). Important contributions of the work here (beyond [6]) include 1) new extensions to the previous PT-only model to propose a complete pan-tilt-zoom mapping model, 2) studies and experiments analyzing the variations between focal length and zoom in the mapping, 3) quantitative experiments and discussion on the model (previous work was only qualitative), 4) new experiments demonstrating robustness with different brands of cameras (previous work was with one type of camera), and 5) new qualitative and quantitative experiments within a full pan-tilt-zoom active tracking application. Due to the nature of the proposed approach, extensive experiments are required and presented. Our resulting new system is also completely automatic (unlike the previous) which is important for scalability in large distributed camera networks. The rest of the paper is organized as follows. In Sect. 2, we discuss previous approaches to the problem addressed. Section 3 describes the proposed model in detail and Sect. 4 presents the experiments. We discuss the contributions of this work in Sect. 5 and conclude with a summary in Sect. 6.
2
Related Work
Even though in commercially available cameras the pan-tilt motor center does not coincide with optical center of the camera, most existing camera models assume idealized mechanics where the optical center and geometric center of the camera are collocated, and the rotation axes are aligned [7–11]. A simple model
582
K. Sankaranarayanan and J.W. Davis
is also assumed in [12] and requires bundle adjustment to obtain global image alignment in the generated panoramas. In [2], they propose a more complex model where they employ a highly non-linear relationship between image coordinates and world coordinates which is then solved for a solution by iterating until convergence. While [1] accounts for rotations around arbitrary axis in space, in order to robustly calibrate all camera parameters it requires a controlled environment using LED marker data. Moreover, this technique employs an iterative minimization technique that is computationally expensive. Similarly [4] employs a search mechanism to minimize an SSD matching score function, which again is computationally expensive. In [3], a correction matrix is used to continuously update the camera parameters with the latest pan and tilt information as the camera moves. However, this information is not precise enough for pixel-level mapping. This paper extends the initial work introduced by [6] by proposing techniques for a fully automatic and complete pan-tilt-zoom mapping model. The main advantage and novelty of the proposed model is that it is independent of the aforementioned problems since it “simulates” a virtual camera located at the optical center. The pan-tilt space is oriented with respect to this virtual camera at the optical center, not the actual camera geometric center. Even though the geometric center is displaced, the variation/change of pan and tilt of this virtual camera at any instant is consistent with the pan and tilt of the actual camera. See supplemental slides. Since our model depends on changes in pan and tilt angles, the images could be considered to have been taken from this virtual camera. By presenting this compact model with thorough and robust results, we demonstrate that the excess computations in the previous work are unnecessary.
3
Proposed Camera Model
In this section, we describe the proposed mapping model (initial work is presented in [6]), techniques to automate the model parameter calculations, and analysis of the model with variations in zoom. The main idea behind the mapping model is that the locus of a point along the principal axis on a fixed infinitelyextended image plane as the camera pans, is an ellipse. We wish to compute the changes in pan and tilt along this ellipse to map the image point (x, y) to its corresponding (pan, tilt) orientation. 3.1
X-Y to Pan-Tilt Mapping
The mapping model presented here is based on the camera geometry shown in Fig. 2. In Fig. 2(a), O represents the optical center of the camera whose principal axis (ray ON ) intersects the image plane I at its principal point R, at distance f (focal length). Notice that as the camera pans, the projections of successive principal points onto the original extended image plane I form the ellipse (e1 ) centered at C. This can also be seen as the bounded case of intersection of a conic by an oblique plane I. Similarly, the locus of an arbitrary point P on the
PTZ Camera Modeling and Panoramic View Generation
583
to O Q
O
P
x
(y+a) si
n M
a
y+
I P' x
f Q y
C
R a
P
b
(b)
C
E
O
e1
f R a c0
N
b
C b e1
(a)
(c)
Fig. 2. Camera Geometry [6]. (a) Overall geometry of the camera model. (b) Calculation of change in pan δθ. (c) Calculation of minor axis b (c0 is a circle of radius b centered at C and parallel to the ground plane).
image plane I as the camera pans is also an ellipse (e2 ). We now wish to compute the change in pan (δθ) and the change in tilt (δφ) that would center the point P (after panning and tilting the camera), given the camera’s current (pan, tilt) location and P ’s image coordinates (x, y). From Fig. 2(b), the desired change in pan δθ is ∠P M Q, which is the angle between planes M P C and M QC and is computed using x −1 (1) δθ = tan y · sin φ + f · cos φ Therefore, the change in pan is essentially a function of the image coordinates (x, y), the focal length (f ), and the current tilt (φ) of the camera. Further from Fig. 2(a), in order to calculate the change in tilt δφ, we need to compute ∠P OR. In terms of the parametric equation of an ellipse, we can represent point E as (a cos(t), b sin(t)). Also note that since the two ellipses e1 and e2 are concentric, the values of this parameter t at E (for e1 ) and at P (for e2 ) are equal. If A and B are the major and minor axes of the outer ellipse e2 , from Fig. 2(a) we see that
584
K. Sankaranarayanan and J.W. Davis
A · cos(t) = y + a B · sin(t) = x Using the relation
A B
(2) (3)
= ab , we compute the value of t as a x −1 · t = tan b y+a
(4)
Therefore, using Eqn.(2) and Eqn.(4), we get A=
y+a
cos tan−1 ( ab ·
x y+a )
This gives the required angle ∠P OR (δφ) as A−a δφ = tan−1 f
(5)
(6)
Again, we need only the (x, y), the tilt (φ), and the focal length f to compute the required change in tilt δφ. As seen from Eqn.(1) and Eqn.(6), the changes in pan and tilt depend on the current tilt (φ) and the pixel location (x, y). Therefore, a simple linear mapping of (x, y) to (θ,φ) based only on the proportional change in the sensor field-of-view is insufficient. 3.2
Automatic Calculation of Focal Length
For this model, the focal length of the PTZ camera f (in pixels) is needed. Moving beyond the manual method of [6], we propose an automatic technique to calculate f based on SIFT keypoint matching across adjacent overlapping images and using RANSAC for elimination of outliers. We begin by pointing the camera to two locations which have overlapping coverage and whose pan-tilt locations are [(θ1 ,φ1 ) and (θ2 ,φ2 )]. We then match SIFT keypoints [13] in the image overlap area (see Fig. 3(a)). For each such matched keypoint in the first image, we have Eqn.(1) to compute its change in pan (δθ1 ) from θ1 (in terms of f ). For each corresponding matched keypoint in the second image, we again have Eqn.(1) to compute its change in pan (δθ2 ) from θ2 (also in terms of f ). Since we know the actual pan locations of the center of the two images (θ1 and θ2 ), we can enforce the constraint (with θ2 > θ1 ) that δθ1 + |δθ2 | = θ2 − θ1
(7)
This gives us an equation from which f can be solved. We repeat the above technique for other such pairs of images obtained by moving the camera to different overlapping pan-tilt locations. However, since the SIFT keypoint finding and matching may not be perfect, we use the well known RANSAC algorithm [14] to perform a robust estimation of the focal length and eliminate outliers using multiple matched keypoints in the images. This process allows us to automatically calculate a robust estimate of the focal length.
PTZ Camera Modeling and Panoramic View Generation
(a)
585
(b)
Fig. 3. (a) SIFT keypoints matched across overlapping regions of an adjacent pair of images. (b) Variation of focal length with zoom for Sony and Pelco cameras.
3.3
Variation with Zoom
A key component of the mapping model is the PTZ camera’s focal length and in order to generalize the model to all zoom levels, we need to know the interrelationship between focal length (f ) and zoom (z). Therefore, we next model the effect of zoom and analyze the variation of f with z. We perform the automatic focal length calculation described in Sect. 3.2 at different zoom levels of the cameras. Here, we choose the range of zoom=1 (wide field-of-view) to zoom=12 as the upper limit of zoom=12 has been observed to be more than sufficient for standard surveillance applications. Once the focal length values for all the zoom levels are calculated, we compute a piecewise cubic fit to this data. Figure 3(b) shows the variation of focal length with zoom for two different brands of PTZ cameras. This function permits the use of an arbitrary zoom level in our model by mapping any zoom value to its corresponding focal length so that it can then be utilized in the (x, y) to (pan,tilt) mapping. To determine the smoothness and accuracy of the learned f -z function, we employ an object size preservation metric. A uniform colored rectangular object is placed in the center of the camera’s view and the object’s motion is restricted to be along the line-of-sight of the camera. Keeping the camera’s pan-tilt (θ,φ) fixed, the camera’s zoom is automatically adjusted so that the size of the object at any instant during its motion remains the same as in the first frame (w × h pixels). To do this, the object’s upper-right corner point (x,y) is detected at every frame (using for example, Canny edge detection) and the point’s difference in pan (δθ) from the home pan (θ) is calculated using Eqn.(1). Since the size of the object is to remain the same and the object is always centered, the locations of all of its corners should also remain constant (±w/2, ±h/2). Therefore, in order to scale the location of the upper-right corner point from (x,y) back to (w/2, h/2) by adjusting the zoom, the new focal length needed for the camera is calculated using the values δθ, φ, w/2, and h/2 in Eqn.(1) and then reformulated to solve for f . This focal length value is now mapped to a corresponding zoom level using the f -z function, thereby adjusting the zoom to this new value. To demonstrate this concept, Fig. 4 shows a red block remaining fairly constant in size as the block is moved forward and the zoom level is adjusted
586
K. Sankaranarayanan and J.W. Davis
(a) zoom=1 n = 235 (0.96%)
(b) zoom=2.72 n = 421 (1.73%)
(c) zoom=3.64 n = 156 (0.64%)
(d) zoom=4.95 n = 393 (1.61%)
Fig. 4. Size of the red block in the image (24, 400 pixels) remains almost constant due to the automatic zoom update. Note the increase in size of the background object.
automatically for the PTZ camera. At each of these levels the number of error pixels (n ) is obtained by finding the number of red pixels outside the target box and the number of non-red pixels inside the target box using color segmentation. Figure 4 also presents the low number of error pixels (and percentage error) at different zoom levels. Note that since the block is moved manually, there may be some minimal translation involved. The values obtained for the error pixels can be used to evaluate the precision of the mapping with variations in zoom for particular application needs.
4
Experiments
In order to evaluate the complete PTZ mapping model proposed in this work, we performed multiple experiments using synthetic data and real data captured from commercial off-the-shelf PTZ cameras. All the experiments covered in this section are important contributions of this work (beyond [6]) because they comprehensively tested the model both qualitatively and quantitatively in the following way. The first set of experiments examined the accuracy of the model by mapping individual pixels from images across the entire scene to their corresponding pan-tilt locations to generate spherical panoramas. We then performed experiments using synthetic data to quantify the accuracy of the mapping model. We next studied the pan-tilt deviations of the cameras and their effects on the mapping model. We then demonstrate the validity of the model across various zoom levels. The final experiment demonstrates the complete pan-tilt-zoom capabilities of the model within a real-time active tracking system. All the experiments involving real data were performed using two models of commercial security PTZ cameras: 1) Pelco Spectra III SE surveillance camera (analog) and 2) Sony IPELA SNC-RX550N camera (digital). The accuracy of pan-tilt recall for these cameras were evaluated by moving the camera from arbitrary locations to a fixed home pan-tilt location and then matching SIFT feature points across the collected home location images. The mean (μ) and standard deviation (σ) of the pan-tilt recall error (in pixels) measured for the Sony and Pelco cameras were (2.231, 0.325) and (2.615, 0.458) respectively. Similarly, we also calculated the mean and variance of the zoom recall error for the Pelco and
PTZ Camera Modeling and Panoramic View Generation
587
Sony cameras as (0.0954, 0.0143) and (0.0711, 0.0082) respectively, by changing the zoom from arbitrary zoom levels to a fixed home zoom level and then polling the camera’s zoom value (repeated for multiple home zoom levels). The radial distortion parameter (κ) values obtained for the Sony and Pelco cameras were 0.017 and 0.043 respectively, determined using the technique from [15] [16]. These values can be used by the reader to evaluate the quality of the cameras used in the experiments. 4.1
Panorama Generation
PTZ cameras generally have a large pan-tilt range (e.g., 360 × 90 degrees). By orienting the camera to an automatically calculated set of overlapping pan-tilt locations, we collected a set of images such that they cover the complete view space. The number of images required for complete coverage vary with the zoom level employed, a study of which follows later. We then took images which are pairwise adjacent from this set and used them to automatically calculate a robust estimate of the focal length f (using the technique described in Sect. 3.2). Next, we mapped each pixel of every image to its corresponding pan-tilt orientation in the 360 × 90 degree space (at a resolution of 0.1 degree). The pan-tilt RGB data were plotted on a polar coordinate system, where the radius varies linearly with tilt and the sweep angle represents the pan. Bilinear interpolation was used to fill any missing pixels in the final panorama. The result is a spherical (linear fisheye) panorama which displays the entire pan-tilt coverage of the camera in a single image.
(a) Ground truth
(b) Constructed panorama
Fig. 5. SIFT keypoint matching quantifying the accuracy of the mapping model
To quantitatively evaluate the accuracy of the above technique, we performed an experiment using a virtual environment synthetically generated in Maya. The ground truth data (Fig. 5(a)) was obtained by setting up a camera with a wide field-of-view1 directly above the environment so that it captures the entire field-of-coverage. We then generated a panorama (Fig. 5(b)) using the synthetic 1
Maya camera parameters: Film Gate-Imax, Film aspect ratio-1.33, Lens squeeze ratio-1, camera scale-1, focal length-2.5.
588
K. Sankaranarayanan and J.W. Davis
(a)
(b)
Fig. 6. Panoramas for different shifts show illumination differences before intrinsic image generation
data captured from different pan-tilt locations using our technique. By using SIFT to match keypoints between the ground truth image and the generated panorama, we calculated the distance between the matched locations for the top 100 keypoints. The mean and standard deviation values were obtained as 7.573 pixels and 0.862 pixels respectively. For images of size 900×900, we believe these values are reasonably small and within the error range of SIFT matching since the interpolation does not reproduce the fine textures very accurately in the scale space. These low values for the matching distance statistics quantitatively demonstrate the accuracy of the panorama generation technique. While performing the above experiments using real cameras, the automatic gain control (AGC) on these cameras could not be fully disabled as the camera moved to new locations. Consequently there were illumination gain differences among the different component images captured across the scene and these differences were reflected in the panorama. Therefore, we used a technique of deriving intrinsic images [17] and computed a reflectance-only panoramic image to overcome this problem. We generated multiple panoramas of the same scene (see Fig. 6) by shifting the pre-configured pan and tilt coverage locations in multiples of 5 and 1 degree intervals respectively. These small shifts in pan and tilt displace the location of the illumination seams in the panoramas across multiple passes, thus simulating the appearance of varying shadows. We then separated the luminance and chrominance channels from the RGB panoramas by changing the color space from RGB to YIQ. After running the intrinsic image algorithm on the Y (luminance) channel alone, we combined the result with the mean I and Q channels of the panorama set, and converted back from YIQ to the RGB color space for display. As shown in Fig. 7(d), the result provides a more uniform panorama as compared to Fig. 6. The above process was tested on multiple Pelco cameras mounted at varying heights (on 2, 4, and 8 story buildings) and a Sony camera. The final results after intrinsic image generation are shown in Fig. 7, demonstrating the applicability of the model to real surveillance cameras at different heights. See high resolution version of Fig. 7(a) in supplementary material.
PTZ Camera Modeling and Panoramic View Generation
(a)
(b)
(c)
589
(d)
Fig. 7. Panoramas from Pelco (a-c) and Sony (d) cameras mounted on buildings at different heights
(a)
(b)
Fig. 8. Panoramas with simulated pan-tilt error values of (a) 1.5 and (b) 2.5. Note the alignment problems in the panorama with =2.5.
4.2
Pan-Tilt Recall
In order to study the effect of inaccuracies in the pan-tilt motor on the proposed model, we simulated varying degrees of error in the data capturing process and then compared the resulting panoramas. This was performed by introducing either a positive or negative (randomly determined) fixed shift/error in the pan and tilt values at each of the pre-configured pan-tilt locations and then capturing the images. This data was then used to generate a spherical panorama using the above mentioned technique. We repeated this process to generate panoramas for different values of simulated pan-tilt shift/error in the range 0.1 to 5 in steps of 0.1 degrees. Figure 8 shows the results of this experiment for values of 1.5 and 2.5 (before intrinsic analysis). The alignment errors for panoramas with values less than 1.5 degrees were visually negligible, but were exacerbated for greater than this value. 4.3
Validity Across Zoom Levels
The next set of experiments were aimed towards verifying that the model works at varying zoom levels. We performed the panorama generation process at different zoom levels in the range of zoom=1 to zoom=12. At each zoom level,
590
K. Sankaranarayanan and J.W. Davis
(a) 76 images
(b) 482 images
(c) 1681 images
(d) 2528 images
Fig. 9. (a)-(d) Panoramas (pre-intrinsic) generated at zoom levels 1, 5, 10, and 12 respectively with their corresponding number of component images
the appropriate focal length was chosen by employing the f -z mapping function learned using the technique described in Sect. 3.3. As the zoom level increases, the field-of-view of each component image reduces and consequently the number of component images required to cover the complete pan-tilt space increases. Figure 9 shows representative panoramas (before intrinsic analysis) generated at zoom levels 1, 5, 10, and 12 for the Sony camera (and presents the number of component images needed to construct the panoramas). In each of the panoramas, it was observed that the small component images registered correctly and no noticeable misalignments were seen with any change in zoom. This demonstrates that the (x, y) to (pan, tilt) mapping model works irrespective of the zoom level employed. 4.4
Use in Tracking with Active PTZ Cameras
Combining the pan, tilt, and zoom mapping capabilities of the proposed model, we developed a real-time active camera tracking application with a wide-area PTZ surveillance camera. The active camera system continually follows the target as it moves throughout the scene by controlling the pan-tilt of the camera to keep the target centered in its view. In addition, the camera’s zoom is continually adjusted so that the target being tracked is always of constant size irrespective of target’s distance from the camera. We used the appearance-based tracking algorithm from [18] to demonstrate the applicability of our PTZ model. Note that our model is indifferent to the tracker itself and various other frame-to-frame tracking algorithms could be employed to handle strong clutter and occlusion situations if needed. In the system, we model the target using a covariance of features and matched it in successive frames to track the target. To build the appearance-based model of the target, we selected a vector of position, color, and gradient features fk = [x y R G B Ix Iy ]. The target window was represented as a covariance matrix of these features. Since in our application, we desire to track targets over long distances across the scene, the appearance of targets undergo considerable change. To adapt to this and to overcome noise, a model update method from [18] was used.
PTZ Camera Modeling and Panoramic View Generation
591
zoom=2
zoom=2.81
zoom=3.64
zoom=4.82
zoom=5.12
zoom=3
zoom=3.72
zoom=4.94
zoom=5.81
zoom=6.77
Fig. 10. Active camera tracking results from two sequences
The tracker is initialized manually by placing a box on the torso of the target. As the target moves, the best match (xmatch , ymatch ) is found by the tracker in successive frames and its distance dmatch is stored. In this system, we restricted our search for the best match to a local search window of 40 × 40 (in frames of size 320 × 240). Using the proposed camera model, the (xmatch , ymatch ) coordinates of the best matching patch are converted to the change in pan and tilt angles required to re-center the camera to the target. The tracker then checks to determine if an update in zoom is necessary. To do so, it checks the distances of a larger patch and a smaller patch at (xmatch , ymatch ). If either of these distances are found to be better than dmatch , the zoom is adjusted to the new size using the same technique as described in the red block experiment in Sect. 4.4. While moving the camera, the system polls the motor to see if it has reached the desired pan-tilt location. Once it has reached this position, the next frame is grabbed and the best match is again found. This process is repeated to continually track the target. Figure 10 shows a few frames from two sequences with a target being automatically tracked using the proposed technique. In the first sequence, the target walks a total distance of approximately 400 feet away from the camera and the zoom level varies from 2 (in the first frame) to 5.12 (in the last frame). In the second sequence, the target covers a distance of around 650 feet along a curved path and the camera’s zoom varies in the range from 3 to 6.77 during this period. See supplemental tracking movie. The ground truth locations of the target were obtained by manually marking the left, right, top, and bottom extents of the target’s torso and calculating its center in each frame. By comparing these with the center of the tracking box (of size 25 × 50), the tracking error statistics (mean, standard deviation) in pixels were obtained as (2.77, 1.50) and (3.23, 1.63) respectively for the two sequences. In addition, the accuracy of the zoom update function was evaluated by comparing the ground truth height of the target’s torso with the height of the tracking box. The error in this case was obtained to be 6.23% and 8.29% for the two sequences. This application demonstrates a
592
K. Sankaranarayanan and J.W. Davis
comprehensive exploitation of the pan-tilt-zoom capabilities of the camera using the proposed model. Again, other trackers could also be employed with our PTZ model for different tracking scenarios/requirements.
5
Key Contributions and Advantages
To summarize the key contributions of this work, we presented 1) a theoretical derivation of a complete pan-tilt-zoom mapping model for real-time mapping of image x-y to pan-tilt orientations, 2) a SIFT-based technique to automatically learn the focal length for commercial-grade PTZ cameras, 3) an analysis of the variations between focal length and zoom in the model along with its evaluation using an object size preservation metric, 4) quantitative experiments and discussion on the PTZ mapping model with panorama generation, 5) experiments demonstrating robustness with different brands of cameras, and 6) qualitative and quantitative experiments within an example PTZ active tracking application. The proposed camera model is a straight-forward and closed-form technique. Since previous work in this area have been based on more complex methods (explained in Sect. 2), by showing our technique with thorough and robust results, we demonstrate that any additional modeling and computation (as shown in previous work) is in fact not necessary for the demonstrated capabilities. We believe our simplification compared to other over-complicated methods is a strength, as this makes our technique more practically applicable to a wide variety of real-time problems.
6
Conclusion and Future Work
We proposed a novel camera model to map the x-y focal plane of a PTZ camera to its pan-tilt space. The model is based on the observation that the locus of a point on a fixed image plane, as the camera pans, is an ellipse using which we solve for the desired change in pan and tilt angles needed to center the point. We proposed techniques to automatically calculate the focal length, analyzed the variation of focal length with zoom, tested the model by generating accurate panoramas, and validated its usability at varying zoom levels. In future work, we plan to utilize the model to map the camera’s pan-tilt-zoom space to groundplane information for registration of multiple cameras. Acknowledgement. This research was supported in part by the National Science Foundation under grant No. 0236653.
References 1. Davis, J., Chen, X.: Calibrating pan-tilt cameras in wide-area surveillance networks. In: Proc. ICCV (2003) 2. Jain, A., Kopell, D., Wang, Y.: Using stationary-dynamic camera assemblies for wide-area video surveillance and selective attention. In: Proc. IEEE CVPR (2006)
PTZ Camera Modeling and Panoramic View Generation
593
3. Bernardin, K., van de Camp, F., Stiefelhagen, R.: Automatic person detection and tracking using fuzzy controlled active cameras. In: Proc. IEEE CVPR (2007) 4. Jethwa, M., Zisserman, A., Fitzgibbon, A.: Real-time panoramic mosaics and augmented reality. In: Proc. BMVC (1998) 5. Tordoff, B., Murray, D.: Reactive control of zoom while fixating using perspective and affine cameras. IEEE TPAMI 26, 98–112 (2004) 6. Sankaranarayanan, K., Davis, J.: An efficient active camera model for video surveillance. In: Proc. WACV (2008) 7. Basu, A., Ravi, K.: Active camera calibration using pan, tilt and roll. IEEE Trans. Sys., Man and Cyber. 27, 559–566 (1997) 8. Collins, R., Tsin, Y.: Calibration of an outdoor active camera system. In: Proc. IEEE CVPR, vol. I, pp. 528–534 (1999) 9. Fry, S., Bichsel, M., Muller, P., Robert, D.: Tracking of flying insects using pan-tilt cameras. Journal of Neuroscience Methods 101, 59–67 (2000) 10. Woo, D., Capson, D.: 3D visual tracking using a network of low-cost pan/tilt cameras. In: Canadian Conf. on Elec. and Comp. Engineering, vol. 2, pp. 884–889 (2000) 11. Barreto, J., Araujo, H.: A general framework for the selection of world coordinate systems in perspective and catadioptric imaging applications. IJCV 57 (2004) 12. Sinha, S., Pollefeys, M.: Pan-tilt-zoom camera calibration and high-resolution mosaic generation. Comp. Vis. and Image Understanding 103, 170–183 (2006) 13. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004) 14. Fischler, M., Bolles, R.: Random sample consensus: a paradigm for model fitting. Communications of the ACM Archive 24, 381–395 (1981) 15. Tordoff, B., Murray, D.: The impact of radial distortion on the self-calibration of rotating cameras (Comp. Vis. and Image Understanding) 16. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) ISBN: 0521623049 17. Weiss, Y.: Deriving intrinsic images from image sequences. In: Proc. ICCV (2001) 18. Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using model update based on means on Riemannian manifolds. In: Proc. IEEE CVPR (2006)
Horror Image Recognition Based on Emotional Attention Bing Li1 , Weiming Hu1 , Weihua Xiong2 , Ou Wu1 , and Wei Li1 1
National Lab of Pattern Recognition, Institute of Automation, CAS, Beijing, China 2 OmniVision Technologies, Sunnyvale, CA, USA
[email protected]
Abstract. Along with the ever-growing Web, people benefit more and more from sharing information. Meanwhile, the harmful and illegal content, such as pornography, violence, horror etc., permeates the Web. Horror images, whose threat to children’s health is no less than that from pornographic content, are nowadays neglected by existing Web filtering tools. This paper focuses on horror image recognition, which may further be applied to Web horror content filtering. The contributions of this paper are two-fold. First, the emotional attention mechanism is introduced into our work to detect emotional salient region in an image. And a topdown emotional saliency computation model is initially proposed based on color emotion and color harmony theories. Second, we present an Attention based Bag-of-Words (ABoW) framework for image’s emotion representation by combining the emotional saliency computation model and the Bag-of-Words model. Based on ABoW, a horror image recognition algorithm is given out. The experimental results on diverse real images collected from internet show that the proposed emotional saliency model and horror image recognition algorithm are effective.
1
Introduction
In the past decades, the implosive growth of on-line information and resources has given us the privilege to share pictures, images, and videos from geographically disparate locations very conveniently. Meanwhile, more and more harmful and illegal content, such as pornography, violence, horror, terrorism etc., permeate in the Web across the world. Therefore, it is necessary to make use of a filtering system to prevent computer users, especially children, from accessing inappropriate information. Recently, a variety of horror materials are becoming a big issue that enters into children’s daily life easily. There are a number of psychological and physiological researches indicating that too much horror images or words can seriously affect children’s health. Rachman’s research [1] shows that horror information is one of the most important factors for phobias. To solve this problem, many governments also have taken a series of measures to ban horror films from children. Over the last years, a number of high-quality algorithms have been developed for Web content filtering and the state of art is improving rapidly. Unfortunately, R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 594–605, 2011. Springer-Verlag Berlin Heidelberg 2011
Horror Image Recognition Based on Emotional Attention
595
in contrast to either the pornographic content filtering [2][3][4] or Violent content filtering[5], there has been comparatively little work done on horror information filtering. In this paper, we aim to rectify this imbalance by proposing a horror image recognition algorithm which is the key ingredient of horror content filtering. Horror image recognition can benefit from image emotion analysis which lots of researchers have been paying more attention to recently. For example, Yanulevskaya et al [6] classify images into 10 emotional categories based on holistic texture features. Solli et al [7] propose a Color based Bag-of-Emotions model for image retrieval from emotional viewpoints. Without loss of generality, we also introduce emotional aspect into horror image recognition. In particular, we present a novel emotional saliency map based on emotional attention analysis which is then fed into a modified Bag-of-Words model, Attention based Bag-of-Words (ABoW) model, to determine images’ emotional content. The experiments on a real image database show that our method can efficiently recognize those horror images.
2
Image and Emotion Introduction
Images not only provide visual information but also evoke some emotional perceptions, such as excitement, disgust, fear etc. Horror unexceptionally comes out when one experiences an awful realization or deeply unpleasant occurrence[8]. Published research in experimental psychology [9] shows that emotions can be measured and quantified, and that common emotional response patterns exist among people with widely varying cultural and educational backgrounds. Therefore, it is possible to establish a relationship between an image and the emotions. Generally, a horror image includes two parts: the region with highly emotional stimuli, named emotional salient region (ESR) and a certain background region. The feeling of horror depends on the interaction between them. Therefore, accurate detection and separation of emotional salient region is important for horror image recognition. Intuitionally, visual saliency scheme may be a proper tool for this task. Unfortunately, existing visual saliency algorithms are incompetent to this task. To this end, we introduce emotional attention mechanism into our work and an emotional saliency map method is proposed.
3
Emotional Attention
It is well known that the primate visual system employs an attention mechanism to limit processing to important information which is currently relevant to behaviors or visual tasks. Many psychophysical and neurophysiologic experiments show that there exist two ways to direct visual attention: One uses bottom-up visual information including color, orientation, and other low level features, the other one uses those information relevant to current visual behaviors or tasks [10]. Recent psychophysical research further reveals that visual attention can be modulated and improved by the affective significance of stimuli [11]. When simultaneously facing emotionally positive, emotionally negative, and nonemotional
596
B. Li et al.
pictures, human automatic attention is always captured by the following order: negative, both negative and positive, and nonemotional stimuli. Therefore, modulatory effects implement specialized mechanisms of ”emotional attention” that might not only supplement but also compete with other sources of top-down control on attention. 3.1
Emotional Saliency Map Framework
Motivated by emotion attention mechanism, we propose a new saliency map of image, called ’emotional saliency map’, which is different from the traditional visual saliency map in three major aspects: (1) it is based on some high-level emotional features, such as color emotion and color harmony which contain more emotional semantics [12][13]; (2) it takes both contrast information and isolated emotional saliency property into account which is inspired by lots of psychological experiments indicating that isolated pixels with various colors have different saliency to human visual system; and (3) it uses both contrast and color harmony value to represent the relationship between pixels. The emotional saliency map computation framework is given out in Fig.1, and we will discuss its implementation more details in the next section. 3.2
Emotional Saliency Map Implementation
Color Emotion based Saliency Map. As a high-level perception, emotion is evoked by many low-level visual features, such as color, texture, shape, etc. However, how these features affect human emotional perception is a difficult issue in current research. Color emotion refers to the emotion caused by a single color and has long been of interest to both artists and scientists. Recent research of Ou et al gives a 3D computational color emotion model, each dimension representing color activity (CA), color weight (CW ), and color heat (CH) respectively.
Fig. 1. Emotional saliency map computation Framework
Horror Image Recognition Based on Emotional Attention
597
The transformation equations between color space and color emotion space are defined as follows[12]: ∗ 2 1/2 −17 CA = −2.1 + 0.06 (L∗ − 50)2 + (a∗ − 3)2 + b 1.4 CW = −1.8 + 0.04(100 − L∗ ) + 0.45 cos(h − 100o) CH = −0.5 + 0.02(C ∗ )1.07 cos(h − 50o )
(1)
where (L∗ , a∗ , b∗ ) and (L∗ , C ∗ , h∗ ) are the color values in CIELAB and CIELCH color spaces respectively. Given a pixel I(x, y) at coordinates (x, y) of image I, we transform its RGB color to CIELAB and CIELCH color spaces, followed by another computation according to Eq (1), its color emotion value is denoted as CE(x, y) = [CA(x, y), CW (x, y), CH(x, y)] . We define single pixel’s emotional saliency EI as: EI(x, y) = [CA(x, y) − minCA]2 + [CH(x, y) − minCH]2 (2) where minCA and minCH are the minimal color activity value and minimal color heat value in the image respectively. This definition is based on the observation that pixels with high color activity (CA) and color heat (CH) are more salient, especially for those horror images; while color weight (CW ) has less effects on emotional saliency. Center-surround contrast is also very important operator in saliency models. Considering that the contrast in the uniformed image regions is close to 0, we propose a multi-scale center-surround operation for color emotion contrast computation: EC(x, y) =
L l=1
Θl (CE(x, y)),
Θl (CE(x, y)) =
x+d
y+d
i=x−d, j=y−d
(3) |CEl (x, y) − CEl (i, j)|
where Θ is center-surround difference operator that computes the color emotion difference between the central pixel (x, y) and its neighbor pixels inside a d × d window in lth scale image from L different levels in a Gaussian image pyramid. EC(x, y) is the multi-scale center-surround color emotion difference which combines the output of Θ operations. In this paper, we always set d = 3 and L = 5. Another factor is emotional difference between the local and global EL(x, y), as defined in Eq(4), where CE is the average color emotion of the whole image. (4) EL(x, y) = CE(x, y) − CE Finally, we define the color emotion based emotional saliency map CS(x, y) as Eq.(5), where N orm() is the Min-Max Normalization operation in an image. CS(x, y) =
1 [N orm(EI(x, y)) + N orm(EC(x, y)) + N orm(EL(x, y))] 3
(5)
598
B. Li et al.
Color Harmony based Saliency Map. Judd et al[13] pointed out: ”When two or more colors seen in neighboring areas produce a pleasing effect, they are said to produce a color harmony.” Simply speaking, color harmony refers to the emotion caused by color combination. A computational model HA(P 1 , P 2 ) to quantify harmony between two colors P 1 , P 2 is proposed by Ou et al [13]. HA(P 1 , P 2 ) = HC (P 1 , P 2 ) + HL (P 1 , P 2 ) + HH (P 1 , P 2 )
(6)
where HC (), HL (), and HH () are the chromatic harmony component, the lightness harmony component, and the hue harmony component, respectively. Since the equations for three components are very complex, we will skip their details due to space limitation, the reader can refer to [13]. HA(P 1 , P 2 )is harmony score of two colors. The higher the score of HA(P 1 , P 2 ) is, the more harmonious the two-color combination is. Similarly, we define a multi-scale center-surround color harmony as well: L
HC(x, y) = Ψl (x, y) =
l=1 x+d
Ψl (x, y), y+d
i=x−d, j=y−d
(7) HA(Pl (x, y), Pl (i, j))
where Ψ is a center-surround color harmony operator that computes the color harmony between the central pixel’s color P (x, y) and its neighbor pixel’s color P (i, j) in a d × d window in lth scale image from L different levels in a Gaussian image pyramid. Next, we define the color harmony score between a pixel and the global image as: (8) HL(x, y) = HA(P (x, y), P ) where P is average color of whole image. The high color harmony score indicates the calm or tranquility emotion [14]. In other words, the higher the color harmony score is, the weaker the emotional response is. Therefore we can transform color harmony score to an emotional saliency value using a monotonic decreasing function. Because the range of color harmony function HA(P 1 , P 2 ) is [−1.24, 1.39][13], a linear transformation function T (v) is given as: T (v) = (1.39 − v)/2.63
(9)
Finally, the color harmony based emotional saliency is computed as: HS(x, y) =
1 [T (HC(x, y) + T (HL(x, y))] 2
(10)
Emotional Saliency Map. Color emotion based emotional saliency map and color harmony based emotional saliency maps are linearly fused into emotional saliency map: ES(x, y) = λ × HS(x, y) + (1 − λ) × CS(x, y)
(11)
where λ is fusing weight and ES(x, y) indicates the emotional saliency of the pixel at location (x, y). The higher the emotional saliency is, the more likely this pixel captures the emotional attention.
Horror Image Recognition Based on Emotional Attention
4
599
Horror Image Recognition Based on Emotional Attention
Bag-of-Words (BoW) is widely used for visual categorization. In the BoW model, all the patches in an image are represented using visual words with same weights. The limitation of this scheme is that a large number of background image patches seriously confuse image’s semantic representation. To address this problem, we extend BoW to integrate emotional saliency map and call it Attention based Bag-of-Words (ABoW). The framework is shown in Fig.2.
Fig. 2. Framework for Attention based Bag-of-Emotions
4.1
Attention based Bag-of-Words
In the ABoW model, two emotional vocabularies, Horror Vocabulary (HV) and Universal Vocabulary (UV), and their histograms are defined to emphasize horror emotion character. The former one is for image patches in ESR while the latter one is for background. The combination of these two histograms is used to represent the image’s emotion. The histogram for each vocabulary is defined as:
M |V OC| M p(Ik ∈ wm ) p(Ik ∈ wm ) (12) Hist(wm , V OC) = k=1
k=1 m=1
where V OC ∈ {HV, U V }, wm is the mth emotional word in V OC. p(Ik ∈ wm ) is the probability that the image patch Ik belongs to the word wm . Because the words in the two vocabularies are independent. Consequently, p(Ik ∈ wm ) can be defined as p(Ik ∈ wm ) = p(Ik ∈ V OC)p(Ik ∈ wm |V OC)
(13)
600
B. Li et al.
The p(Ik ∈ V OC) is the probability of selection V OC to represent Ik . Because the image patches in ESR is expected to be representd by the HV histogram, we assume that p(Ik ∈ HV ) = p(Ik ∈ ESR) ∝ ESk ;
p(Ik ∈ U V ) = p(Ik ∈ / HV )
(14)
where ESk is the average emotional saliency of image patch Ik and computed by the emotional saliency computation model. The higher ESk is, the more possibility Ik belongs to ESR. Consequently, Eq (13) can be rewritten as: p(Ik ∈ hwm ) = ESk × p(Ik ∈ hwm |HV ) (15) p(Ik ∈ wm ) = p(Ik ∈ uwm ) = (1 − ESk ) × p(Ik ∈ uwm |U V ) where hwm and uwm are the emotional words in HV and UV respectively. The conditional probability p(Ik ∈ wm |V OC) is defined by the multivariate Gaussian function. p(Ik ∈ wm |V OC) =
1 ((2π)D/2
1/2
|Σ|
1 exp{− (Fk − cm )T Σ −1 (Fk − cm )} (16) 2 )
where Fk is a D dimensional feature vector of Ik ; Cm is a feature vector of emotional word wm , Σ is the covariance matrix, and | • | is the determinate. The nearer the distance between Fk and Cm is, the higher the corresponding probability p(Ik ∈ wm |V OC) is. 4.2
Feature Extraction
In the ABoW framework, feature extraction for each image patch is another key step. To fill the gap between low level image features and high level image emotions, we select three types of features for each image patch, Color ,Color Emotion and Weibull Texture. Color Feature: We consider the image color information in HSV space. k k k k k k Two feature sets are defined as: [f1 , f2 , f3 ] = [Hk , Sk , Vk ] and [f4 , f5 , f6 ] = [Hk − H , Sk − S , Vk − V ]. The former one is the average of all the pixels in the patch, and the latter one is defined as the difference between average value of the patch and that of the whole image. Texture Feature: Geusebroek et al [15] report a six-stimulus basis for stochastic texture perception. Fragmentation of a scene by a chaotic process causes the spatial scene statistics to conform to a Weibull-distribution. γ wb(x) = β
γ−1 x γ x e−( β ) β
(17)
The parameters of the distribution can completely characterize the spatial structure of the texture. The contrast of an image can be represented by the width
Horror Image Recognition Based on Emotional Attention
601
of the distribution β, and the grain size is given by γ, which is the peakedness of the distribution. So [f7k , f8k ] = [γk , βk ] represents the local texture feature of the patch Ik and the texture difference between Ik and whole image is k ] = [|γk − γ¯ | , βk − β¯]. [f9k , f10 Emotion Feature: The emotion feature is also extracted based on color emotion and color harmony theories. The average color emotion of image patch Ik , and the k k k difference between Ik and whole image are [f11 , f12 , f13 ] = [CA k , CWk , CHk ] and k k k [f14 , f15 , f16 ] = [ CAk − CA , CWk − CW , CHk − CH ]. We use the color harmony score between average color Pk of Ik and average color P of the whole k image to indicate the harmony relationship between them,[f17 ] = [HA(Pk , P )]. 4.3
Emotional Histogram Construction
The feature vectors are quantized into visual word for two vocabularies, HV and UV, by k-means clustering technique independently. For HV, horror words are carried out from those interesting regions with manual annotation in each horror training image. All image patches in both horror images and non-horror images in the training set are used to construct words of UV. Then each image is represented by combining the horror word histogram Hist(wm , HV ) and the universal emotional word histogram Hist(wm , U V ), as [Hist(wm , HV ), Hist(wm , U V )]. And Support Vector Machine (SVM) is applied as the classifier to determine whether any given image is horror one or not.
5
Experiments
To evaluate the performance of our proposed scheme, we conducted two separate experiments, one is about emotional attention, and the other one is about Horror image recognition. 5.1
Data Set Preparation and Error Measure
Because of lacking open large data set for horror image recognition, we collected and created one. A large number of candidate horror images are collected from internet. Then 7 Ph.D students in our Lab are asked to label them from one of the three categories: Non-horror, A little horror, and Horror. They are also required to draw a bounding box around the most emotional salient region (according to their understanding of saliency) of each image. The provided annotations are used to create a saliency map S = {sX |sX ∈ [0, 1]} as follow: sX
N 1 n = a N n=1 X
(18)
where N is the number of labeling users and anX ∈ {0, 1} is a binary label give by nth user to indicate whether or not the pixel X belongs to salient region. We
602
B. Li et al.
select out 500 horror images from these candidates to the create horror image set, each labeled as ’Horror’ by at least 4 users. On the other hand, we also collect 500 non-horror images with different scenes, objects or emotions. Specially, the non-horror images includes 50 indoor images, 50 outdoor images, 50 human images, 50 animal images, 50 plant images, and 250 images with different emotions (adorable, amusing, boring, exciting, irritating, pleasing, scary, and surprising) which are downloaded from an image retrieval system ALIPR(http://alipr.com/). Finally, all these 500 horror images, along with their emotional saliency S, and 500 non-horror images are put together to construct a horror image recognition image set (HRIS) used in the following experiments. Given the ground truth annotation sX and the obtained emotional saliency value eX with different alogrithms, the precision, recall, and Fα measure defined in Eq.(19) are used to evaluate the performance. As many previous work [16], α is set as 0.5 here. P recsion = X sX eX/ eX ; Recall = X sX eX/ sX X X (19) recsion×Recall Fα = (1+α)×P α×P recsion+Recall 5.2
Experiments for Emotional Saliency Computation
To test the performance of emotional saliency computation, we need to select optimal fusing weight λ in Eq(11) firstly. The F0.5 value change curve with λ on the 500 horror image set is shown in Fig.3(A). It tells us that the best F0.5 value, 0.723, is achieved when we set λ = 0.2 . So λ is always set to be 0.2 in the followings. From the optimal λ selection, we can find that the emotional saliency map with only color emotion based saliency map (λ = 0) is better than that with only color harmony based saliency map (λ = 1). But the color harmony based saliency map also can improve the color emotion based saliency map. Both of them are important for the final emotional saliency map. We also compare our emotional saliency computation method with other existing visual saliency algorithms, Itti’s method (ITTI)[17], Hou’s method (HOU)[18], Graph-based visual saliency algorithm (GBVS)[19], and Frequencytuned Salient Region Detection algorithm (FS)[20]. The precision (P re), recall
Fig. 3. (A)F0.5 change curve as a function of λ. (B) Comparison with other methods.
Horror Image Recognition Based on Emotional Attention
603
(Rec) and F0.5 values for each method are calculated and shown in Fig.3 (B). In addition, in order to validate the effectiveness of color emotion, we also apply our saliency framework on the RGB space, denote as ’Our(RGB)’ in Fig.3. The output of our methods are P re = 0.732, Rec = 0.748, and F0.5 = 0.737 respectively, showing that it outperforms all other visual saliency computation methods. This is due to the fact that visual saliency computation are solely based on local feature contrast without considering any pixel’s salient property itself and some emotional factors, so neither of them can detect emotional saliency accurately. In contrast, emotional factors that can integrate the single pixel’s salient value are added in our solution, so it can improve performances. In addition, the proposed method also outperforms Our(RGB), which indicates the color emotion and color harmony theories are effective for emotional saliency detection. 5.3
Horror Image Recognition
In this experiment, an image is divided into 16 × 16 image patches with overlapping 8 pixels. We have discussed many features in section 4.2. Now we need to find the feature or feature combination that best fits horror image recognition at hand. We divide these features into 3 subsets: Color Feature ([f1 − f6 ]), Texture Feature ([f7 − f10 ]) and Emotion Feature ([f11 − f17 ]). Seven different selections (or combinations) are tried based on the ABoW model. In order to simplify the parameter selection, we set the word numbers in HV and UV equal, NH = NU = 200. We equally divide the image set into two subsets A and B. Each subset includes 250 horror images and 250 non-horror images. We then use A for training and B for test and vice versa. The combined results of the two experiments are used as the final performance. The Precision, Recall and F1 values, which are widely used for visual categorization evaluation, are adopted to evaluate the performances of each feature combination. Different from the computation of Precision and Recall in emotional saliency evaluation, the annotation label here for horror image recognition is binary, ’1’ for horror image and ’0’ for non-horror images. The experimental results are shown in Table 1. Table 1 shows that the Emotion feature outperforms the other single features, meaning that the color emotion and color harmony features are more useful for Table 1. Comparison with different feature combinations based on ABoW Feature
P rec
Rec
F1
Color 0.700 0.742 0.720 Texture 0.706 0.666 0.685 Emotion 0.755 0.742 0.748 Color + Texture 0.735 0.772 0.753 Color + Emotion 0.746 0.780 0.763 Texture + Emotion 0.769 0.744 0.756 Color + Texture + Emotion 0.768 0.770 0.769
604
B. Li et al. Table 2. Comparison with other emotional categorization methods Method
P re
Rec
F1
EVC 0.760 0.622 0.684 BoE 0.741 0.600 0.663 Proposed (Without Emotional Saliency) 0.706 0.698 0.702 Proposed (With Emotional Saliency) 0.768 0.770 0.769
horror image recognition. Regarding combinational features, combination of the color, texture and emotion features performs best, with F1 = 0.769. In order to check the effectiveness of the emotional saliency map, we also construct the combinational histogram without emotional saliency for horror image recognition, in which the emotional saliency value for each image patch is set as 0.5. That is to say, the weights for the horror vocabulary and for the universal vocabulary are equal. The same feature combination (Color + Texture + Emotion) is used for horror image recognition both with and without emotional saliency. In addition, we also compare our method with Emotional Valence Categorization (EVC) algorithm [12] proposed by Yanulevskaya and color based Bagof-Emotions (BoE) model[13]. The experimental results are shown in Table 2. The results show that the proposed horror image recognition algorithm based emotional attention outperforms all other methods. The comparison between with and without emotional saliency gives out that the emotional saliency is effective for horror image recognition. The potential reason lies in that the EVC and BoE describe image’s emotion only from global perspective, but the horror emotional is usually generated by image’s local region. This proves again that the emotional saliency is useful to find the horror region from images.
6
Conclusion
This paper has proposed a novel horror image recognition algorithm based on emotional attention mechanism. Reviewing this paper, we have the following conclusions: (1) This paper has presented a new and promising research topic, Web horror image filtering, which is supplement to existing Web content filtering. (2) Emotional attention mechanism, an emotion driven attention, has been introduced and a initial emotional saliency computation model has been proposed. (3)We have given out an Attention based Bag-of-Words (ABoW) framework, in which the combinational histogram of horror vocabulary and universal vocabulary has been used for horror image recognition. Experiments on real images have shown that our methods can recognize those horror images effectively and can be used as a new basis for horror content filtering in the future. Acknowledgement. This work is partly supported by National Nature Science Foundation of China (No. 61005030, 60825204, and 60935002), China Postdoctoral Science Foundation and K. C. Wong Education Foundation, Hong Kong.
Horror Image Recognition Based on Emotional Attention
605
References 1. Rachman, S.: The conditioning theory of fear acquisition. Behaviour Research and Therapy 15, 375–378 (1997) 2. Forsyth, D., Fleck, M.: Automatic detection of human nudes. IJCV 32, 63–77 (1999) 3. Hammami, M.: Webguard: A web filtering engine combining textual, structural, and visual content-based analysis. IEEE T KDE 18, 272–284 (2006) 4. Hu, W., Wu, O., Chen, Z.: Recognition of pornographic web pages by classifying texts and images. IEEE T PAMI 29, 1019–1034 (2007) 5. Guermazi, R., Hammami, M., Ben Hamadou, A.: WebAngels filter: A violent web filtering engine using textual and structural content-based analysis. In: Perner, P. (ed.) ICDM 2008. LNCS (LNAI), vol. 5077, pp. 268–282. Springer, Heidelberg (2008) 6. Yanulevskaya, V., van Gemert, J.C., Roth, K.: Emotional valence categorization using holistic image features. In: Proc. of ICIP, pp. 101–104 (2008) 7. Solli, M., Lenz, R.: Color based bags-of-emotions. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702, pp. 573–580. Springer, Heidelberg (2009) 8. http://en.wikipedia.org/wiki/Horror_and_terror 9. Lang, P.J., Bradley, M.M., Cuthbert, B.N.: International affective picture system (iaps): Technical manual and affective ratings, Tech. Rep., Gainesville, Centre for Research in Psychophysiology (1999) 10. Sun, Y., Fisher, R.: Object-based visual attention for computer vision. AI 146, 77–123 (2003) 11. Vuilleumier, P.: How brains beware: neural mechanisms of emotional attention. TRENDS in Cognitive Sciences 9, 585–594 (2005) 12. Ou, L., Luo, M., Woodcock, A., Wright, A.: A study of colour emotion and colour preference. part i: Colour emotions for single colours. Color Res. & App. 29, 232– 240 (2004) 13. Ou, L., Luo, M.: A colour harmony model for two-colour combinations. Color Res. & App. 31, 191–204 (2006) 14. Fedorovskaya, E., Neustaedter, C., Hao, W.: Image harmony for consumer images. In: Proc. of ICIP, pp. 121–124 (2008) 15. Geusebroek, J., Smeulders, A.: A six-stimulus theory for stochastic texture. IJCV (62) 16. Liu, T., Sun, J., Zheng, N.N., Tang, X., Shum, H.Y.: Learning to detect a salient object. In: Proc. of CVPR, pp. 1–8 (2007) 17. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE T PAMI 20, 1254–1259 (1998) 18. Hou, X., Zhang, L.: Saliency detection: A spectral residual approach. In: Proc. of CVPR, pp. 1–8 (2007) 19. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Proc. of NIPS, pp. 545–552 (2006) 20. Achanta, R., Hemami, S., Estrada, F., Ssstrunk, S.: Frequency-tuned salient region detection. In: Proc. of CVPR, pp. 1597–1604 (2009)
Spatial-Temporal Affinity Propagation for Feature Clustering with Application to Traffic Video Analysis Jun Yang1,2 , Yang Wang1,2 , Arcot Sowmya1 , Jie Xu1,2 , Zhidong Li1,2 , and Bang Zhang1,2 1
School of Computer Science and Engineering, The University of New South Wales 2 National ICT Australia {jun.yang,yang.wang,jie.xu,zhidong.li,bang.zhang}@nicta.com.au,
[email protected] Abstract. In this paper, we propose STAP (Spatial-Temporal Affinity Propagation), an extension of the Affinity Propagation algorithm for feature points clustering, by incorporating temporal consistency of the clustering configurations between consecutive frames. By extending AP to the temporal domain, STAP successfully models the smooth-motion assumption in object detection and tracking. Our experiments on applications in traffic video analysis demonstrate the effectiveness and efficiency of the proposed method and its advantages over existing approaches.
1
Introduction
Feature points clustering has been exploited extensively in object detection and tracking, and some promising results have been achieved [1–10]. There are roughly two classes of approaches: one is from a top-down graph-partitioning perspective [1, 2, 6], the other features a bottom-up multi-level hierarchical clustering scheme [4, 5, 7, 8]. Du and Piater [3] instantiate and initialize clusters using the EM algorithm, followed by merging overlapping clusters and splitting spatially disjoint ones. Kanhere and Birchfield [9] find stable features heuristically and then group unstable features by a region growing algorithm. Yang et al [10] pose the feature clustering problem as a general MAP problem and find the optimal solution using MCMC. Very few existing methods consider spatial and temporal constraints simultaneously in an unified framework. Affinity Propagation [11] is a recently proposed data clustering method which simultaneously considers all data points as potential exemplars (cluster centers), and recursively exchanges real-valued messages between data points until a good set of exemplars and corresponding clusters gradually emerges. In this paper, we extend the original Affinity Propagation algorithm for feature points clustering by incorporating temporal clustering consistency constraints into the messagepassing procedure, and apply the proposed method to the analysis of real-world traffic videos. The paper is organized as follows. First we revisit the Affinity Propagation algorithm in section 2, and then propose the Spatial-Temporal Affinity R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 606–618, 2011. Springer-Verlag Berlin Heidelberg 2011
Spatial-Temporal Affinity Propagation for Feature Clustering
607
Propagation algorithm for feature points clustering in section 3. We give experimental results on traffic video analysis in section 4 and conclude the paper in section 5.
2
Affinity Propagation
Affinity Propagation [11] is essentially a special case of Loopy Belief Propagation algorithm specifically designed clustering. Given n data points and a for data set of real-valued similarities s(i, k) between data points, Affinity Propagation simultaneously considers all data points as potential exemplars (cluster centers), and works by recursively exchanging real-valued messages between data points until a good set of exemplars and corresponding clusters gradually emerges. The similarity s(i, k), i = k indicates how well data point k is suited to be the exemplar for data point i, while s(k, k) is referred to as “preference” and data points with higher preference values are more likely to be chosen as exemplars. The preference thus serves as a parameter for controlling the number of identified clusters. If all data points are equally likely to be exemplars a priori, the Algorithm 1. Affinity Propagation Input: similarity values s(i, k), maxits, convits, λ Output: exemplar index vector c 1 iter ← 0, conv ← 0, a ← 0, r ← 0, c ← −∞ 2 repeat 3 iter ← iter + 1, oldc ← c, oldr ← r, olda ← a 4 forall i, k do r(i, k) ← s(i, k) − max a(i, k ) + s(i, k ) 5
6 7
k : k =k
r ← (1 − λ) ∗ r + λ ∗ oldr for i = k do
a(i, k) ← min 0, r(k, k) +
10 11 12 13 14 15 16 17 18 19 20
max 0, r(i , k)
i : i ∈{i,k} /
8 9
forall k do a(k, k) ← max 0, r(i , k) i : i =k
a ← (1 − λ) ∗ a + λ ∗ olda foreach k do if r(k, k) + a(k, k) > 0 then ck ← k foreach k do if ck = k then ck ← arg maxi: ci =i s(k, i) if c = oldc then conv ← conv + 1 else conv ← 1 until iter > maxits or conv > convits
608
J. Yang et al.
preferences should be set to a common value, usually being the median of input similarities. Two types of messages are exchanged between data points, each representing a different kind of competition. Point i sends point k the “responsibility” r(i, k) which reflects the accumulated evidence for the suitability of point i choosing point k as its exemplar, considering other potential exemplars other than k. Point k sends point i the “availability” a(i, k) which reflects the accumulated evidence for the suitability of point k serving as the exemplar for point i, considering the support from other points who choose point k as their exemplar. To avoid numerical oscillations, messages are usually damped based on their values from the current and previous iterations. At any iteration of the message-passing procedure, the positivity of the sum r(k, k)+a(k, k) can be used to identify exemplars, and the corresponding clusters can be calculated. The algorithm terminates after a predefined number of iterations, or if the decided clusters have stayed constant for some number of iterations. A summary of the Affinity Propagation algorithm is given in Algorithm 1.
3 3.1
Spatial-Temporal Affinity Propagation Motivation
Expressed in the Affinity Propagation terminology, the feature points clustering problem can be described as: given a set feature points {f1τ1 :t , f2τ2 :t , · · · , of tracked τn :t fn } and a set of similarity values s(i, k) derived from spatial and temporal cues, find the optimal configuration of hidden labels c = (c1 , . . . , cn ), where ci = k if data point i has chosen k as its exemplar. For general object detection and tracking, the temporal constraint models the smooth-motion assumption. For feature clustering specifically, it reduces to the requirement that the clustering configurations of two consecutive frames should be consistent and stable. To incorporate this temporal constraint into the Affinity Propagation framework, we introduce a temporal penalty function for the labeling configuration regarding k chosen as an exemplar: −∞, if ck = k but ∃i ∈ Gt−1 (k) : ci = k δkT (c) = (1) 0, otherwise where Gt−1 (k) denotes the cluster at time t−1 that feature k belongs to. Gt−1 (k) can be calculated because each feature point is tracked individually. The temporal penalty poses the constraint that if k is chosen as an exemplar at time t, then the feature points which were in the same cluster as k at time t−1 should all choose k as their exemplar at time t. In other words, δkT (c) implements a relaxed version of the strong constraint that feature points that were clustered together at time t − 1 should still be clustered together at time t. Because of the dynamic nature of feature points (being tracked, appear or disappear), satisfying δkT (c) does not guarantee the satisfaction of the strong constraint, for example when newly-born features are selected as exemplars. Nevertheless, the benefits
Spatial-Temporal Affinity Propagation for Feature Clustering
609
of defining δkT (c) by (1) are twofold: on the one hand it is more flexible than the strong constraint by allowing new feature points to contribute to the reclustering of old feature points, on the other hand it keeps the optimization problem simple and tractable, and can be seamlessly integrated into the original Affinity Propagation framework as we will show shortly. 3.2
Derivation
Following the same derivation method as the original Affinity Propagation [11] algorithm, Spatial-Temporal Affinity Propagation can be viewed as searching over valid configurations of c so as to maximize the net similarity S: S(c) =
n
s(i, ci ) +
i=1
where δkS (c) =
n
δk (c)
(2)
k=1
δk (c) = δkS (c) + δkT (c) −∞, if ck = k but ∃i : ci = k
(3) (4)
0, otherwise
δkS (c) is the spatial penalty that enforces the validity of the configuration c with regard to k being chosen as an exemplar. δkT (c) is the temporal penalty defined in (1) that enforces the temporal constraint.
s(1, ·)
s(2, ·)
···
ci
δk
l ai av
il ab
···
δN
y it
α i← α i←1
···
cN
···
s(N, ·)
δk 2
···
s(i, ·)
···
α i←N
ci
δk
ρ 1→k
ρ 2→k
···
k
···
y it
·)
c2
sp re
s on
il ib
s(i,
c1
···
α i←
δ3
k
δ2
ρ i→
δ1
ρ N→
···
k
ci
s(i, ·)
Fig. 1. Left: factor graph for Affinity Propagation. Middle: the calculation of responsibilities. Right: the calculation of availabilities.
Function (2) can be represented as a factor graph as shown in Fig. 1, and optimization can be done using the max-sum algorithm (the log-domain version of the sum-product algorithm). There are two kinds of messages: the message sent from ci to δk (c), denoted as ρi→k (j), and the message sent from δk (c) to ci , denoted as αi←k (j). Both messages are vectors of n real numbers - one for each possible value, j, of ci , and they are calculated according to (5) and (6) as follows: αi←k (ci ) (5) ρi→k (ci ) = s(i, ci ) + k :k =k
610
J. Yang et al.
with αi←k (ci ) best possible configuration satisfying δk given ci
= ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ =
max
j1 ,...,ji−1 ,ji+1 ,...,jn
δk (j1 , . . . , ji−1 , ci , ji+1 , . . . , jn ) + ρi →k (ji ) i :i =i
best configuration with or without cluster k
max ρ
j i :i ∈G / t−1 (k)
i →k
(j ) +
ρ
i →k
(k), for ci = k = i
i :i ∈Gt−1 (k)/{k}
best configuration with no cluster k
i :i =k
max ρi →k (j ), for ci = k = i
j :j =k
best configuration of others
k is a exemplar
→k (j ) + max ρ ρk→k (k) + i
ρi →k (k), for ci = k = i
⎪ j ⎪ ⎪ t−1 (k) i :i ∈{i}∪G / i :i ∈Gt−1 (k)/{k,i} ⎪ ⎪ ⎪ ⎪ ⎪ with no cluster k ⎪ best configuration ⎪
⎪ ⎪ ⎪ ⎪ max ρ (j )+ max ρ (j ), max ⎪ k→k i →k ⎪ j :j =k j :j =k ⎪ ⎪ :i ∈{i,k} i / ⎪ ⎪ ⎪ ⎪ ⎪ best configuration with a cluster k ⎪ ⎪
⎪ ⎪ ⎪ ⎪ ρk→k (k) + max ρi →k (j ) + ρi →k (k) , for ci = k = i ⎪ ⎪ j ⎩ t−1 t−1 i :i ∈{i}∪G /
(k)
i :i ∈G
(k)/{k,i}
(6) If we assume that each message consists of a constant part and a variable part, ˜ i←k (ci ) + α ¯ i←k , and define i.e. ρi→k (ci ) = ρ˜i→k (ci ) + ρ¯i→k and αi←k (ci ) = α ρ¯i→k = maxj:j=k ρi→k (j) and α ¯ i←k = αi←k (ci : ci = k), then we have that max ρ˜i→k (j) = 0 max ρ˜i→k (j) = max 0, ρ˜i→k (k) j:j=k
j
α ˜ i←k (ci ) = 0, for ci = k α ˜ i←ci (ci ), for ci = k α ˜ i←k (ci ) = 0, for ci = k k :k =k
(7) (8) (9) (10)
By substituting the above equations into (5) and (6), we get simplified message calculations: ⎧ ⎪ s(i, k) + α ¯i←k , for ci = k ⎪ ⎪ ⎨ k :k =k ρi→k (ci ) = (11) ⎪ ⎪ s(i, ci ) + α ˜i←ci (ci ) + α ¯i←k , for ci = k ⎪ ⎩ k :k =k
Spatial-Temporal Affinity Propagation for Feature Clustering
611
αi←k (ci ) ⎧ ⎪ max 0, ρ˜i →k (k) + ρ˜i →k (k) + ρ¯i →k , for ci=k=i ⎪ ⎪ ⎪ :i =k :i ∈G t−1 (k) :i ∈Gt−1 (k)/{k} ⎪ i i / i ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ρ¯i →k , for ci=k=i ⎪ ⎪ ⎨ i :i =k = ⎪ →k (k) + →k (k) + ρ ˜ (k)+ max 0, ρ ˜ ρ ˜ ρ¯i →k , for ci=k=i ⎪ k→k i i ⎪ ⎪ ⎪ :i =i :i ∈{i}∪G t−1 (k) :i ∈Gt−1 (k)/{k,i} i ⎪ i / i ⎪ ⎪
⎪ ⎪ ⎪ ⎪ max 0,˜ ρk→k (k)+ max 0, ρ˜i →k (k) + ρ˜i →k (k) + ρ¯i →k , for ci=k=i ⎪ ⎩ t−1 (k) i :i ∈{i}∪G /
i :i ∈Gt−1 (k)/{k,i}
i :i =i
(12) If we solve for ρ˜i→k (ci = k) = ρi→k (ci = k) − ρ¯i→k and α ˜ i←k (ci = k) = ¯ i←k , we obtain simple update equations where the ρ¯ and α ¯ αi←k (ci = k) − α terms cancel:
˜ i←j (j) ρ˜i→k (k) = ρi→k (k) − max ρi→k (j) = s(i, k) − max s(i, j) + α (13) j:j=k
j:j=k
α ˜ i←k (k) = αi←k (k) − αi←k (j : j = k) ⎧ ⎪ max 0, ρ˜i →k (k) + ρ˜i →k (k), for k = i ⎪ ⎪ ⎨ i :i ∈G / t−1 (k) i :i ∈Gt−1 (k)/{k}
= (14) ⎪ ⎪ max 0, ρ˜i →k (k) + ρ˜i →k (k) , for k = i min 0, ρ˜k→k (k) + ⎪ ⎩ t−1 (k) i :i ∈{i}∪G /
i :i ∈Gt−1 (k)/{k,i}
The second equation of (14) follows from x − max(0, x) = min(0, x). ρ˜i→k (ci ) and α ˜ i←k (ci ) for ci = k are not used in (13) and (14) because α ˜ i←k (ci = k) = 0. Now the original vector messages are reduced to scalar ones with responsibilities ˜ i←k (k), and the defined as r(i, k) = ρ˜i→k (k) and availabilities as a(i, k) = α message update rules finally become:
(15) r(i, k) = s(i, k) − max s(i, j) + a(i, j) j:j=k
⎧ ⎪ ⎪ ⎪ ⎨ a(i, k) =
max 0, r(i , k) + r(i , k), for k = i
i :i ∈G / t−1 (k)
i :i ∈Gt−1 (k)/{k}
(16) ⎪ ⎪ min 0, r(k, k) + max 0, r(i , k) + r(i , k) , for k = i ⎪ ⎩ t−1 (k) i :i ∈{i}∪G /
i :i ∈Gt−1 (k)/{k,i}
Compared to the original message update rules described in Algorithm 1, we can see that the update equation for responsibilities is left unchanged. For the update equation of availabilities, the temporal constraint translates into having the responsibility messages coming from members of the cluster containing k at time t−1 fixed rather than feeding them into the max(0, ·) function. For values of
612
J. Yang et al.
k corresponding to newly-born features, the r(i , k) term in (16) vanishes and the availability update rule reduces to the original version. Everything else including the label estimation equation and algorithm termination criteria remain unchanged from the original algorithm. The STAP algorithm is summarized in Algorithm 2.
4
Experiments
To validate the effectiveness and efficiency of the proposed STAP algorithm, we test it with two applications in traffic video analysis. For vehicle detection and tracking, we compare it with the the Normalized Cut approach [2], as well as the original Affinity Propagation. For virtual loop detector, we compare it with a background subtraction based approach. All the methods are implemented in unoptimized C code and run on a Pentium IV 3.0 GHz machine. 4.1
Vehicle Detection and Tracking
Implementation Details. We adopt a similarity function which incorporates spatial and temporal cues from the feature point level: s(i, k) = −γs ds (i, k) − γm dm (i, k) − γb db (i, k)
(17)
ds (i, k) = ||Fit − Fkt || measures the spatial proximity between point i and point k, where Fit is the real-world position of fit calculated from camera calibration. dm (i, k) = maxtj=max{τi ,τk } ||(Fit −Fij )−(Fkt −Fkj )|| models the motion similarity fkt between the two trajectories. db (i, k) = 1 B t (f ) = 1 calculates the f =fit number of background pixels between the two points, where B t is the background mask at time t. For temporal association, we use a novel model based approach. First we define for a cluster Gti the model fitness function f it(Gti ) based on the overlap relationship between the convex hull mask spanned by Gti and a typical vehicle shape mask generated by projecting a 3D vehicle model at the corresponding with the clusterGt−1 from position of the cluster. A cluster Gti is associated k t−1 t the previous frame if k = arg maxj |Gj | f it(Gi ∪ Gt−1 ) > β . For each a j unassociated Gti , we assign a new vehicle label to it if it satisfies f it(Gti ) > βn and |Gti | > βs . We employ a few preprocessing and postprocessing steps: (1) calibrate camera offline by assuming a planar road surface. (2) build a reference background image using simple temporal averaging method. (3) detect and track KLT [12] feature points, and prune the ones falling in the background region or out of ROI. (4) dropping small clusters whose sizes are below a predefined threshold βd . For the Normalized Cut implementation: (1) we use the same spatial and temporal cues, and the pairwise affinity between point i and point k is defined as e(i, k) = exp s(i, k). (2) for the recursive top-down bipartition process, we adopt a model based termination criteria rather than the original one which
Spatial-Temporal Affinity Propagation for Feature Clustering
613
is based on the normalized cut value. The partitioning of graph G terminates if f it(G) = 1. (3) The same temporal association scheme, preprocessing and postprocessing steps are applied. For STAP and AP, we set maxits to 1000, convits to 10, λ to 0.7, s(k, k) to the median of similarities s(i, k), i = k. All the other parameters of the three methods are set by trial-and-error to achieve the best performance. Experiment Setup. We use three diversified traffic video sequences for evaluation purposes. The three sequences are from different locations, different camera installations and different weather conditions. city1 and city2 are two challenging sequences of a busy urban intersection with city1 recorded with a low-height off-axis camera on a cloudy day and city2 with a low-height on-axis camera on a rainy day. They both contain lots of complicated traffic events during traffic light cycles such as the typical process of vehicles decelerate, merge, queue together, then accelerate and separate. The low height of the cameras brings extra difficulties to the detection task. city3 is recorded near a busy urban bridge with a medium-height off-axis camera in sunny weather. For each sequence, the
Algorithm 2. Spatial-Temporal Affinity Propagation
1 2 3 4 5 6 7
Input: similarity values s(i, k), maxits, convits, λ, Gt−1 (k) Output: exemplar index vector c iter ← 0, conv ← 0, a ← 0, r ← 0, c ← −∞ repeat iter ← iter + 1, oldc ← c, oldr ← r, olda ← a forall i, k do r(i, k) ← s(i, k) − max a(i, k ) + s(i, k ) k : k =k
r ← (1 − λ) ∗ r + λ ∗ oldr for i = k do
a(i, k) ← min 0, r(k, k) +
10 11 12 13 14 15 16 17 18 19 20
max 0, r(i , k) + r(i , k)
t−1 (k) i :i ∈{i}∪G /
8 9
forall k do a(k, k) ←
i :i ∈Gt−1 (k)/{k,i}
max 0, r(i , k) + r(i , k)
i :i ∈G / t−1 (k)
i :i ∈Gt−1 (k)/{k}
a ← (1 − λ) ∗ a + λ ∗ olda foreach k do if r(k, k) + a(k, k) > 0 then ck ← k foreach k do if ck = k then ck ← arg maxi: ci =i s(k, i) if c = oldc then conv ← conv + 1 else conv ← 1 until iter > maxits or conv > convits
614
J. Yang et al. Table 1. Sequence information sequence camera camera weather sequence labeled labeled name position height condition length frames vehicles city1 city2 city3
off-axis low cloudy on-axis low rainy off-axis medium sunny
2016 2016 1366
200 200 136
2035 2592 1509
Table 2. Performance information. TPR stands for true positive rate, and FPR stands for false positive rate. sequence name
TPR at 10% FPR run time (ms/frame) NCut AP STAP NCut AP STAP v1 v2 v1 v2 v1 v2 v1 v2 v1 v2 v1 v2
city1 city2 city3
0.65 0.61 0.66 0.68 0.67 0.70 85 136 82 112 83 117 0.67 0.66 0.73 0.73 0.78 0.76 108 244 101 182 104 189 0.42 0.40 0.45 0.44 0.48 0.49 64 80 63 76 64 77
groundtruth is manually labeled at approximately ten frames apart. Detailed information for each sequence is given in Table 1. To facilitate the analysis, we designed two versions for each of the three methods. We label the individual vehicle lanes for each sequence beforehand. The first version works on each lane separately, denoted as v1. The second version works on all the lanes combined, denoted as v2. The performance calculation is based on the comparison with hand-labeled groundtruth on a strict frame-by-frame basis. Experimental Result. The ROC curves of the three methods, each with two versions on the three sequences, are shown in Fig. 2, and detailed performance comparison can be found in Table 2. We can see that both STAP and AP achieve higher performance and run faster than Normalized Cut, and STAP performs better than AP at the cost of a little extra run time. For all the methods, the performance on city3 is the lowest among the three sequences because of the much lower video quality of city3 and specific difficulties in sunny weather video such as shadow cast by vehicles, trees, etc, while the performance on city1 is worse than that on city2 because of the more frequent inter-lane occlusions for off-axis cameras. For STAP and AP, v1 on city1 does not offer an improvement over v2 in performance probably because in off-axis cameras, feature points from one vehicle usually spill over two or even more lanes, thus making lane separation a disadvantage rather than an advantage. For each method, the two versions give us an idea of how the algorithm scales with problem size. We see from Table 2 that the run time of Normalized Cut increases much faster with problem size than STAP and AP do. Normalized Cut performs reasonably well in our experiments because of two improved components in our implementation which are different from the original approach: one is the model based temporal association scheme which can
1
1
0.9
0.9
0.8 0.7 0.6
NCut v1 NCut v2 AP v1 AP v2 STAP v1 STAP v2
0.5 0.4 0.3 0.2 0.1 0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
False Positive Rate
0.16
0.18
0.2
0.8 0.7 0.6
NCut v1 NCut v2 AP v1 AP v2 STAP v1 STAP v2
0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
False Positive Rate
0.6
0.7
True Positive Rate
1 0.9
True Positive Rate
True Positive Rate
Spatial-Temporal Affinity Propagation for Feature Clustering
615
0.8 0.7 0.6
NCut v1 NCut v2 AP v1 AP v2 STAP v1 STAP v2
0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
False Positive Rate
0.8
0.9
1
Fig. 2. ROCs on the three sequences. From left to right: city1, city2 and city3.
−0.05
−0.1
Net Similarity
−0.15
−0.2
−0.25
−0.3
−0.35
−0.4
−0.45
−0.5
0
5
10
15
20
25
30
Iteration
35
40
45
50
Fig. 3. AP vs. STAP. From left to right: clustering results from the previous frame, current frame clustering by AP, current frame clustering by STAP, and the convergence of net similarity for STAP. Identified exemplars are represented by triangles.
avoid undesirable associations that could lead to cluster drift, dilation and other meaningless results; the other is the model based termination criteria for recursive bipartitioning which introduces object-level semantics into the process. The advantage of STAP over the original AP algorithm in feature clustering is illustrated in Fig. 3. Without temporal constraint, AP could end up with fragmented results caused by factors such as unstable image evidence, even though the clustering configuration from the previous frame is good. STAP yields a much more desirable solution than AP by enforcing temporal consistency. From the figure we can also see that despite the good clustering result, the identified exemplars usually do not have physical meaning. The net similarity value for the STAP example is also plotted in Fig. 3 which shows the fast convergence of STAP. Examples of detection results of STAP on the three sequences are shown in Fig. 4 and examples of generated vehicle trajectories are given in Fig. 5. 4.2
Virtual Loop Detector
Accurate object level information is not always available for traffic state estimation because of the inherent difficulties in the task of vehicle detection and tracking. In such situations, traffic parameters estimated from low-level features are still very useful for facilitating traffic control. A loop detector is a device buried underneath the road surface which can output simple pulse signals when vehicles are passing from above. Loop detectors have long been used in traffic monitoring and control, specifically for estimating traffic density, traffic flow, etc. Because of the high installation and maintenance costs, many computer
616
J. Yang et al.
Fig. 4. STAP example detections on the three sequences. From left to right: city1, city2 and city3.
250
250
250
200
200
200
150
150
150
100
100
100
50
50
50
0
0
0
50
100
150
200
250
300
350
0
50
100
150
200
250
300
350
0
0
50
100
150
200
250
300
350
Fig. 5. STAP example vehicle trajectories on the three sequences. From left to right: city1, city2 and city3.
vision programs, called virtual loop detectors, have been proposed to simulate the functions of a physical loop device. Implementation Details. We use the occupancy ratio of a virtual loop mask Mv to estimate the “on” and “off” status of loop v. For STAP virtual loop detector, we generate a object mask Mot , which consists of the convex hulls of feature clusters at time t. The status of loop v is estimated as |Mv ∩Mot | > βls on, if |M s v| lv = (18) off, otherwise For background subtraction virtual loop detector, the status of loop v is estimated as t v ∩F | on, if |M|M > βlb b | v (19) lv = off, otherwise where F t is the foreground mask at time t. The parameters of STAP are set in the same way as in the previous application. For background subtraction, we use a simple image averaging method and the parameters are fine-tuned manually. Experiment Setup. We use the two sequences city1 and city2 from the previous application for the purpose of evaluation. For each sequence, the groundtruth
1
1
0.9
0.9
True Positive Rate
True Positive Rate
Spatial-Temporal Affinity Propagation for Feature Clustering
0.8 0.7 0.6 0.5 0.4 0.3 0.2
BgSub STAP
0.1 0
0
0.1
0.2
0.3
0.4
0.5
False Positive Rate
0.6
0.8 0.7 0.6 0.5 0.4 0.3 0.2
BgSub STAP
0.1
0.7
617
0
0
0.05
0.1
0.15
False Positive Rate
0.2
0.25
Fig. 6. From left to right: ROCs and example STAP virtual loop detectors on city1 and city2 respectively.
status of each virtual loop is manually recorded for each frame by a human observer. The performance calculation is done by frame-by-frame comparison between detection and groundtruth. Experimental Result. The ROC curves of the two methods on the two sequences are shown in Fig. 6 indicating the superiority of STAP over background subtraction in simulating loop detectors. Background subtraction works at the pixel level and is known to be sensitive to difficulties such as illumination change, camera shake, shadow, etc. While STAP outputs feature clusters which are at the object or semi-object level, thus avoiding some of the problems that background subtraction is subject to. Some example detections of the STAP virtual loop detector are also given in Fig. 6.
5
Conclusion
In this paper we discuss an extension of the Affinity Propagation algorithm to the scenario of feature points clustering by incorporating temporal constraint of the clustering configurations between consecutive frames. Applications in traffic video analysis demonstrate the advantages of the proposed method over the original algorithm and the state-of-the-art traffic video analysis approaches. In the future work, we will explore more applications in traffic video analysis using STAP feature clustering, and investigate the possibility of fusing feature points and other informative features in the STAP framework. Acknowledgement. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.
References 1. McLauchlan, P.F., Beymer, D., Coifman, B., Malik, J.: A real-time computer vision system for measuring traffic parameters. In: CVPR, pp. 495–501 (1997) 2. Kanhere, N.K., Pundlik, S.J., Birchfield, S.: Vehicle segmentation and tracking from a low-angle off-axis camera. In: CVPR, pp. 1152–1157 (2005) 3. Du, W., Piater, J.: Tracking by cluster analysis of feature points using a mixture particle filter. In: AVSS, pp. 165–170 (2005)
618
J. Yang et al.
4. Brostow, G.J., Cipolla, R.: Unsupervised bayesian detection of independent motion in crowds. In: CVPR, pp. 594–601 (2006) 5. Antonini, G., Thiran, J.P.: Counting pedestrians in video sequences using trajectory clustering. TCSVT 16, 1008–1020 (2006) 6. Rabaud, V., Belongie, S.: Counting crowded moving objects. In: CVPR, pp. 705– 711 (2006) 7. Li, Y., Ai, H.: Fast detection of independent motion in crowds guided by supervised learning. In: ICIP, pp. 341–344 (2007) 8. Kim, Z.: Real time object tracking based on dynamic feature grouping with background subtraction. In: CVPR (2008) 9. Kanhere, N., Birchfield, S.: Real-time incremental segmentation and tracking of vehicles at low camera angles using stable features. TITS 9, 148–160 (2008) 10. Yang, J., Wang, Y., Ye, G., Sowmya, A., Zhang, B., Xu, J.: Feature clustering for vehicle detection and tracking in road traffic surveillance. In: ICIP, pp. 1152–1157 (2009) 11. Frey, B.J., Dueck, D.: Clustering by Passing Messages Between Data Points. Science 315, 972–976 (2007) 12. Tomasi, C., Shi, J.: Good features to track. In: CVPR, pp. 593–600 (1994)
Minimal Representations for Uncertainty and Estimation in Projective Spaces Wolfgang F¨ orstner University Bonn, Department of Photogrammetry Nussallee 15, D-53115 Bonn
[email protected] Abstract. Estimation using homogeneous entities has to cope with obstacles such as singularities of covariance matrices and redundant parametrizations which do not allow an immediate definition of maximum likelihood estimation and lead to estimation problems with more parameters than necessary. The paper proposes a representation of the uncertainty of all types of geometric entities and estimation procedures for geometric entities and transformations which (1) only require the minimum number of parameters, (2) are free of singularities, (3) allow for a consistent update within an iterative procedure, (4) enable to exploit the simplicity of homogeneous coordinates to represent geometric constraints and (5) allow to handle geometric entities which are at infinity or at least very far, avoiding the usage of concepts like the inverse depth. Such representations are already available for transformations such as rotations, motions (Rosenhahn 2002), homographies (Begelfor 2005), or the projective correlation with fundamental matrix (Bartoli 2004) all being elements of some Lie group. The uncertainty is represented in the tangent space of the manifold, namely the corresponding Lie algebra. However, to our knowledge no such representations are developed for the basic geometric entities such as points, lines and planes, as in addition to use the tangent space of the manifolds we need transformation of the entities such that they stay on their specific manifold during the estimation process. We develop the concept, discuss its usefulness for bundle adjustment and demonstrate (a) its superiority compared to more simple methods for vanishing point estimation, (b) its rigour when estimating 3D lines from 3D points and (c) its applicability for determining 3D lines from observed image line segments in a multi view setup.
Motivation. Estimation of entities in projective spaces, such as points or transformations, has to cope with the scale ambiguity of these entities, resulting from the redundancy of the projective representations, and with the definition of proper metrics which on one hand reflect the uncertainty of the entities and on the other hand lead to estimates which are invariant to possible gauge transformation, thus changes of the reference system. The paper shows how to consistently perform Maximum likelihood estimation for an arbitrary number of geometric entities in projective spaces including elements at infinity under realistic assumptions without a need to impose constraints resulting from the redundant representations. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 619–632, 2011. Springer-Verlag Berlin Heidelberg 2011
620
W. F¨ orstner
The scale ambiguity of homogeneous entities results from the redundant representation, where two elements, say 2D points, x (x) and y (y) are identical, in case their representations with homogeneous coordinates, here with x and y, are proportional. This ambiguity regularly is avoided by proper normalization of the homogeneous entities. Mostly one applies either Euclidean normalization, say xe = x/x3 (cf. [13]), then accepting that no elements at infinity can be represented, or spherical normalization, say xs = x/|x|, then accepting that the parameters to be estimated sit on a non-linear manifold, here the unit sphere S 2 , cf. [7, 11]. The sign ambiguity usually does not cause difficulty, as the homogeneous constraints used for reasoning are independent on the chosen sign. The singularity of contrained observations also has also been pointed out in [5]. The uncertainty of an observed geometric entity in many practical cases, can be represented sufficiently well by a Gaussian distribution, say N (μx , Σxx ). The distribution of derived entities, y = f (x), resulting from a non-linear transformation can also be approximated by a Gaussian distribution, using Taylor expansion at the mean µx and omitting higher order terms. The degree of approximation depends on the relative accuracy and has been shown to be negligible in most cases, cf. [8, p. 55]. The invariance of estimates w.r.t. the choice of the coordinate system of the estimated entities usually is achieved, by minimizing a function in the Euclidean space of observations, in the context of bundle adjustment being the reprojection i )T Σx−1 i ). (xi − x error, leading to the optimization function Ω = i (xi − x i xi This at the same time is the Mahalanobis distance between the observed and estimated entities and can be used to evaluate whether the model fits to the data. This situation becomes difficult, in case one wants to handle elements at infinity and therefore wants to use spherically normalized homogeneous vectors, or at least normalized direction vectors when using omnidirectional cameras, as their covariance matrices are singular. The rank deficiency is at least one, due to the homogeneity. In case further constraints need to be taken into account, as the Plcker constraint for 3D lines or the singularity constraint for fundamental matrices, the rank deficiency increases with the number of these constraints. Therefore, in case we want to use these normalized vectors or matrices as observed quantities, already the formulation of the optimization function based on homogeneous entities is not possible and requires a careful discussion about estimable quantities, cf. [17]. Also the redundant representation requires additional constraints, which lead to Lagrangian parameters in the estimation process. As an example, one would need four parameters to estimate a 2D point, three for the homogeneous coordinates and one Lagrangian for the constraint, two parameters more than the degrees of freedom. It remains open, how to arrive at a minimal representation for the uncertainty and at the estimation of all types of geometric entities in projective spaces which are free of singularities and allow to handle entities at infinity. Related work. This problem has been addressed successfully for geometric transformations. Common to these approaches is the observation that all types of
Minimal Representations for Uncertainty and Estimation
621
transformations form differentiable groups called Lie groups. Starting from an approximate transformation, the estimation can exploit the corresponding Lie algebra, being the tangent space at the unit element of the manifold of the group. Take as an example the group SO(3) of rotations: Starting from an approximate rotation a a close-by rotation can be represented by = (ΔR) a , with a small rotation with rotation matrix (ΔR) depending on the small rotation vector ΔR. This small rotation vector can be estimated in the approximate model ≈ ( 3 + (ΔR)) a , where the components of the rotation vector ΔR appear linearly, building the three dimensional space of the Lie algebra so(3) corresponding to the Lie group SO(3). Here (.) is the skew symmetric matrix inducing the crossproduct. The main difference to a Taylor approximation, ≈ a + Δ , is the multiplicative correction in (1), which is additive, say which guarantees that the corrected matrix is a proper rotation matrix, based on the exponential representation (ΔR) = exp( (ΔR)) of a rotation using skew symmetric matrices. This concept can be found in all proposals for a minimal representation for transformations: Based on the work of Bregler et al. [6], Rosenhahn et al. [19] used the exponential map for modelling spatial Euclidean motions, the special Euclidean group SE(3) being composed of rotations SO(3) and translations in IR3 . Bartoli and Sturm [2] used the idea to estimate the fundamental matrix with a minimal representation = 1 Diag(exp(λ), exp(−λ), 0) 2 , twice using the rotation group and once the multiplication group IR+ . Begelfor and Werman [4] showed how to estimate a general 2D homography with a minimal representation statistically rigorous, namely using the special linear group SL(3) of 3 × 3matrices with determinant 1, and its Lie algebra sl(3) consisting of all matrices with trace zero, building an eight dimensional vector space, correctly reflecting the correct number of degrees of freedom. To our knowledge the only attempt to use minimal representations for geometric entities other than transformations have been given by Sturm [20] and ˚ Astrm [1]. Sturm suggested a minimal representation of conics, namely = T , with = Diag(a, b, c). Using a corresponding class of homographies = Diag(d, e, f ), being a rotation matrix, which map any conic into the unit circle Diag(1, 1, −1), he determines updates for the conic in this reduced representation and at the end undoes the mapping. Another way to achieve a minimal representation for homogeneous entities is given by ˚ Astrm [1] in the context of structure from motion. He proposes to use the Cholesky decomposition of the pseudo inverse of the covariance matrix of the spherically normalized homogeneous 2D point coordinates x to arrive at a whitened and reduced observation. For 3D lines he uses a special double point representation with minimal parameters. He provides no method to update the approximate values xa guar ∈ S 2 . His 3D line representation is also not linked to anteeing the estimate x projective Plcker representation, and he cannot estimate elements at infinity. Our proposal is similar in flavour to the idea of Sturm and the method of ˚ Astrm to represent uncertain 2D points. However, it is simpler to handle, as it directly works on the manifold of the homogeneous entities.
622
W. F¨ orstner
Notation. We name objects with calligraphic letters, say a point x , in order to be able to change representation, e. g. using Euclidean coordinates denoted with a slanted letter x or homogeneous coordinates with an upright letter x. Matrices are denoted with sans serif capital letters, say , or in case of homogeneous matrices . This transfers to the indices of covariance matrices, the covariance matrix Σxx for a Euclidean vector and Σxx for a homogeneous vector. The operator N(.) normalizes a vector to unit length. We adopt the Matlab syntax to denote the stack of two vectors or matrices, e. g. z = [x; y] = [xT , y T ]T . Stochastic variables are underscored, e. g. x.
1
Minimal Representation of Uncertainty
The natural space of homogeneous entities are the unit spheres S n , possibly constrained to a subspace. Spherically normalized homogeneous coordinates of 2D points (xs ) and lines (ls ), and of 3D points (Xs )and planes (As ) live on S 2 and S 3 resp., while 3D lines, represented by Plcker coordinates (Ls ), live on the Klein quadric Q. The points on S 3 build the Lie group of unit quaternions, which could be used for an incremental update q = Δq qa of spherically normalized 3D point or plane vectors. However, there exist no Lie groups for points on the 2sphere or on the Klein quadric, see [18]. Therefore we need to develop an update scheme, which guarantees the updates to lie on the manifold, without relying on some group concept. We will develop the concept for unit vectors on the S 2 , representing 2D points and lines, and generalize it to the other geometric entities. 1.1
Minimal Representation for Uncertain Points in 2D and 3D
Let an uncertain 2D point x be represented with its mean, the 2-vector µx and its 2 × 2-covariance matrix Σxx . It can be visualized by the standard ellipse −1 (x − µx )T Σxx (x − µx ) = 1. Spherically normalizing the homogeneous vector x = [x; 1] = [u, v, w]T yields xs =
x , |x|
Σxs xs = Σxx T
with
Σxx =
Σxx 0 0T 0
,
=
1 ( 3 − xs xsT ) |x| (1)
with rank (Σxx ) = 2 and null(Σxx ) = xs . Taking the smallest eigenvalue to be an infinitely small positive number, one sees that the standard error ellipsoid is flat and lies in the tangent space of x at S 2 . In the following we assume all point vectors x to be spherically normalized and omit the superscript s for simplicity of notation. We now want to choose a coordinate system x (x) = [s, t] in the tangent space ⊥ x and represent the uncertainty by a 2 × 2-matrix in that coordinate system. This is easily achieved by using
x (x) = null(xT ) ,
(2)
Minimal Representations for Uncertainty and Estimation
w
y
O2
x u
v t
s
623
Fig. 1. Minimal representation for an uncertain point x (x) on the unit sphere S 2 representing the projective plane IP2 by a flat ellipsoid in the tangent plane at x . The uncertainty has only two degrees of freedom in the tangent space spanned by two basis vectors s and t of the tangent space, being the null space of xT . The uncertainty should not be too large, such that the deviation of the distribution on the sphere and on the tangent do not differ too much, as at point y .
assuming it to be an orthonormal matrix fullfilling T x (x) x (x) = 2 , see fig. 1. We define a stochastic 2-vector xr ∼ M (0, Σxr xr ) with mean 0 and covariance Σxr xr in the tangent space at x . In order to arrive at a spherically normalized random vector x with mean µx we need to spherically normalize the vector xt = µx + x (µx ) xr in the tangent space and obtain x(µx , xr ) = N (µx + x (µx ) xr ) ,
∂x x (µx ) = ∂xr x=µx
(3)
(4)
We thus can identify x (µx ) with the Jacobian of this transformation. The inverse transformation can be achieved using the pseudo inverse of x (x) which due to the construction is the transpose, + (x) = T x (x). This leads to the reduction of the homogeneous vector to its reduced counterpart xr = T x (µx ) x .
(5)
As = 0 the mean of xr is the zero vector, µxr = 0. This allows to establish the one to one correspondence between the reduced covariance matrix Σxr xr and the covariance matrix Σxx of x: T x (µx ) µx
Σxx = x (µx ) Σxr xr T x (µx ) ,
Σxr xr = x (µx )T Σxx x (µx ) .
(6)
We use (5) to derive reduced observations and parameters and after estimating r then apply (4) to find corrected estimates x r ). =x (xa , Δx corrections Δx A similar reasoning leads to the representation of 3D points. Again, the Jacobian X is the null space of XT and spans the 3-dimensional tangent space of S 3 at X. The relations between the singular 4 × 4-covariance matrix of the spherically normalized vector X and the reduced 3 × 3-covariance matrix ΣXr Xr are equivalent to (6). Homogeneous 3-vectors l representing 2D lines and homogeneous 4-vectors A representing planes can be handled in the same way. 1.2
Minimal Representation for 3D Lines
We now generalize the concept for 3D lines. Lines L in 3D are represented by their Plcker coordinates L = [Lh ; L0 ] = [Y − X, X × Y ] in case they are
624
W. F¨ orstner
derived by joining two points X (X) and Y (Y ). Line vectors need to fullfill the quadratic Plcker constraint LT h L0 = 0 and span the Klein quadric Q consisting of all homogeneous 6-vectors fulfilling the Plcker constraint. The dual line L (L) has Plcker coordinates L = [L0 ; Lh ], exchanging its first and second 3-vector. We also will use the Plcker matrix I (L) = XYT − YXT of a line and will assume 3D line vectors L to be spherically normalized. As in addition a 6-vector needs to fullfill the Plcker constraint in order to represent a 3D line, the space of 3D lines is four dimensional. The transfer of the minimal representation of points to 3D-lines requires some care. The four dimensional tangent space is perpendicular to L, as LT L − 1 = 0 T holds and perpendicular to L, as L L = 0 holds. Therefore the the tangent space is given by the four columns of the 6 × 4 matrix
(7) L (L) = null L , L T again assuming this matrix to be orthonormal. Therefore for random perturbations Lr we have the general 6-vector Lt (µL , Lr ) = µL + L (µL ) Lr
(8)
in the tangent space. In order to arrive at a random 6-vector, which is both spherically normalized and fullfills the Plcker constraint also for finite random perturbations we need to normalize Lt = [Lth , Lt0 ] accordingly. The two 3-vectors Lth and Lt0 in general are not orthogonal. Following the idea of Bartoli [3] we therefore rotate these vectors in their common plane such that they become orthogonal. We use a simplified modification, as the normalization within an iteration sequence will have decreasing effect. We use linear interpolation of the directions Dh = N(Lth ) and D0 = N(Lt0 ). With the distance d = |D h − D0 | and the shortest distance r = 1 − d2 /4 of the origin to the line joining D h and D0 we have M h,0 = ((d ± 2r) D h + (d ∓ 2r) D 0 ) |Lth | / (2d)
(9)
The 6-vector M = [M h ; M 0 ] now fullfills the Plcker constraint but needs to be spherically normalized. This finally leads to the normalized stochastic 3D line coordinates . (10) L = N(Lt (µL , Lr )) = M / |M| which guarantees L to sit on the Klein quadric, thus to fullfill the Plcker constraint. zed, omitting the superscript e for clarity. The inverse relation to (10) is Lr = T L (µL ) L
(11)
as L (µL ) is an orthonormal matrix. The relation between the covariances of L and Lr therefore are ΣLL = L (µL ) ΣLr Lr T L (µL ) ,
ΣLr Lr = L (µL )+ ΣLL +T L (µL ) .
(12)
Minimal Representations for Uncertainty and Estimation
625
Conics and quadrics can be handled in the same manner. A conic = [a, b, c; b, d, e; c, d, e] can be represented by the vector c = [a, b, c, d, e, f ]T , normalized to 1, and – due the homogeneity of the representation – is a point in IP5 with the required Jacobian being the nullspace of cT . In contrast to the approach of Sturm [20], this setup would allow for singular conics. Using the minimal representations introduced in the last section, we are able to perform ML-estimation for all entities. We restrict the model to contain constraints between observed and unknown entities only, not between the parameters only; a generalization is easily possible. We start with the model where the observations have regular covariance matrices and then reduce the model, such that also observations with singular covariance matrices can be handled. Finally, we show how to arrive at a Mahalanobis distance for uncertain geometric entities, where we need the inverse of the covariance matrix. The optimization problem. We want to solve the following optimization problem minimize
Ω(v) = v T Σll−1 v
subject to
g(l + v, x) = 0
(13)
where the N observations l, their N × N covariance matrix Σll and the G constraint functions g are given, and the N residuals v and the U parameters x are unknown. The number G of constraints needs to be larger than the number of parameters U . Also it is assumed the constraints are functionally independent. The solution yields the maximum likelihood estimates, namely the fitted and parameters l, under the assumption that the observations observations x are normally distributed with covariance matrix Σll = D(l) = D(v), and the true observations ˜l fullfill the constraints given the true parameters x ˜. Example: Bundle adjustment. Bundle adjustment is based on the projection relation xij = λij j Xi between the scene points Xi , the projection matrices j and the image points xij of point Xi observed in camera j. The classical approach eliminates the individual scale factors λij by using Euclidean coordinates for the image points. Also the scene points are represented by their Euclidean coordinates. This does not allow for scene or image points at infinity. This may occur when using omnidirectional cameras, where a representation of the image points in a projection plane is not possible for all points or in case scene points are very far compared to the length of the motion path of a camera, e.g. at the horizon. Rewriting the model as xij × j Xi = 0 eliminates the scale factor without constraining the image points to be not at infinity. Taking the homogeneous coordinates of all image points as observations and the parameters of all cameras and the homogeneous coordinates of all scene points as unknown parameters shows this model to have the structure of (13). The singularity of the covariance matrix of the spherically normalized image points and the necessity to represent the scene points also with spherically normalized homogeneous vectors, requires to use the corresponding reduced coordinates. For solving the generally non-linear problem, we assume approximate values a a x and l for the fitted parameters and observations to be available. We thus
and Δx for the fitted observations and parameters using search for corrections Δl
626
W. F¨ orstner
a
and x l = l+ With these assumptions we can rephrase the =x a +Δx. v = l +Δl
= (la − l + Δl)
T Σ −1 (la − l + Δl)
subject optimization problem: minimize Ω(Δl) ll a
x = 0 The approximate values are iteratively improved by a + Δx) to g(l + Δl,
and Δx using the linearized constraints finding best estimates for Δl a
=0 + T Δl a ) + Δx g(l , x
with the corresponding Jacobians mate values.
and
(14)
of g to be evaluated at the approxi-
Reducing the model. We now want to transform the model in order to allow for observations with singular covariances. For simplicity we assume the vectors l and x of all observations and unknown parameters can be partitioned into I and J individual and mutually uncorrelated observational vectors li , i = 1, ..., I and parameter vectors xj , j = 1, ..., J, referring to points, lines, planes or transformations. We first introduce the reduced observations lri , the reduced corrections of the
ri and the reduced corrections Δx rj : observations Δl a a )li , lri = T li (l , x
ri = T (la , x
i , a )Δl Δl li
rj = T (la , x j a )Δx Δx xj (15) where each Jacobian refers to the type of the entity it is applied to. The reduced approximate values are zero, as they are used to define the reduction, e. g. a a from (5) we conclude xar = T x (x )x = 0. We collect the Jacobians in two block T T a T a a )} and T a )} in order to arrive at diagonal matrices l = { li (l , x x = { xj (l , x the reduced observations lr = T l l, the corrections for the reduced observations T Δlr = T Δl and parameters Δx r = x Δx. l Second we need to reduce the covariance matrices Σli li . This requires some care: As a covariance matrix is the mean squared deviation from the mean, we need to refer to the best estimate of the mean when using it. In our context the a best estimate for the mean at the current iteration is the approximate value li . Therefore we need to apply two steps: (1) transfer the given covariance matrix, a referring to li , such that it refers to li and (2) reduce the covariance matrix to the minimal representation lri . As an example, let the observations be 2D lines with spherically normalized homogeneous vectors li . Then the reduction a a is achieved by: Σlari lli = ai Σli li aT with ai = T i x (li ) (li , li ), namely by first applying a minimal rotation from li to lai (see [16, eq. (2.183)], second reducing the covariance matrix following (6). Observe, we use the same Jacobian x as for points, exploiting the duality of 2D points and 2D lines. The superscript a in Σlari lli indicates the covariance to depend on the approximated values. The reduced constraints now read as a
r = 0 r + T Δl a ) + r Δx g(l , x with r = T T = T T (16) r
x
r
l
Now we need to minimize the weighted sum of the squared reduced residuals a
= −lr + Δl
r thus we need to minimize Ω(Δl
r ) = (−lr + r = lr − lr + Δl v −1 r T a
(−lr + Δlr ) subject to the reduced constraints in (16). Δlr ) Σlr lr
Minimal Representations for Uncertainty and Estimation
627
The parameters of the linearized model are obtained from (cf. [16, Tab. 2.3]) T a −1 Σx r x r = (T r )−1 r ( r Σlr lr r ) r = Σx x T ( T Σ a r )−1 wg Δx r r lr lr r r T a a
r ) − v a Δlr = Σ r ( Σ r )−1 (wg − r Δx lr lr
r
lr lr
(17) (18) (19)
a
r . It contains the theoretical covariance matrix a) + T using wg = −g(l , x rv Σx r x r in (17), at the same time being the Cramer-Rao-bound. The weighted sum of residuals Ω is χ2 -distributed with G − U degrees of freedom, in case the observations fulfill the constraints and the observations are normally distributed with covariance matrix Σll and can be used to test the validity of the model.
r are used to update the approximate values for r and Δl The corrections Δx the parameters and the fitted observations using the corresponding non-linear transformations, e. g. for an unknown 3D line one uses (10). Example: Mahalanobis distance of two 3D lines. As an example for such an estimation we want to derive the mean of two 3D-lines Li (Li , ΣLi Li ), i = 1, 2 and derive the Mahalanobis distance d(L1 , L2 ) of the 3D lines. The model reads 1, L 2) = L 2 − L 1 = 0, which, exploits the fact that the line coordinates g(L are normalized. First we need to choose an appropriate approximate line La , which in case the two lines are not too different can be one of the two. Then we reduce the two lines using the Jacobian L (La ), being the same for both lines Li , and obtain Lri = TL (La )Li and the reduced covariance matrices ΣLa ri ,Lri = Ti ΣLi Li i with Ti = TL (La ) (Li , La ) using the minimal rotation from Li to La . As there are no parameters to be estimated, the solution becomes simple. With r = 0 in (19) the reduced residuals are the Jacobian T = [− 6 , 6 ] and using Δx a a a −1 ri = ±ΣLri Lri (ΣLr1 Lr1 + ΣLr2 Lr2 ) (−(Lr2 − Lr1 )). The weighted sum Ω of v the residuals is the Mahalanobis distance and therefore can be determined from d2 (L1 , L2 ) = (Lr2 − Lr1 )T (ΣLa r2 ,Lr2 + ΣLa r2 ,Lr2 )−1 (Lr2 − Lr1 ) ,
(20)
as the reduced covariance matrices in general have full rank. The squared distance d2 is χ24 distributed, which can be used for testing. In case one does not have the covariance matrices of the 3D lines Li but only two points Xi (X i ) and Yi (Y i ) of a 3D line segment, and a good guess for their uncertainty, say σ2 3 , one easily can derive the covariance matrix from Li = [Y i − X i ; X i × Y i ] by variance propagation. The equations directly transfer to the Mahalanobis distance of two 2D line segments. Thus, no heuristic is required to determine the distance of two line segments.
2
Examples
We want to demonstrate the setup with three interrelated problems: (1) demonstrating the superiority of the rigorous estimation compared to a classical one when estimating a vanishing point, (2) fitting a straight line through a set of 3D
628
W. F¨ orstner
points and (3) inferring a 3D line from image line segments observed in several images as these tasks regularly appear in 3D reconstruction of man-made scenes and solved suboptimally, see e.g. [12]. Estimating a vanishing point. Examples for vanishing point estimation from line segments using this methodology are found in [9]. Here we want to demonstrate that using a simple optimization criterion can lead to less interpretable results. We compare two models for the random perturbations of line segments: α) a model which determines the segment by fitting a line through the edge pixels, which are assumed to be determinable with a standard deviation of σp , β) the model of Liebowitz [15, Fig. 3.16], who assumes the endpoints to have zero mean isotropic Gaussian distribution, and that each of the endpoint measurements are independent and have the same variance σe2 . Fig. 2 demonstrates the difference of the two models. The model α, using line fitting, results in a significant decrease of the uncertainty for longer line segments, the difference in standard deviation going with the squareroot of the lenght of the line segment. For a simulated case
Fig. 2. Error bounds (1 σ) for line segments. Dashed: following the model of Liebowitz [15], standard deviation standard deviation at end points σ = 0.15 [pel] , Solid: from fitting a line through the edge pixels with standard deviation σp = 0.3 [pel] for segments with 10, 40 and 160 pixels. Deviations 20 times overemphasized. The uncertainty differ by a factor of 2 on the avaerage in this case.
with 10 lines between 14 and 85 pixels, the uncertainty models on an average differ by 4 85/14 ≈ 1.6 in standard deviation. Based on 200 repetitions, the empirical scatter of the vanishing point of the method β of Liebowitz is appr. 20 % larger in standard deviation than the method α using the line fitting accuracy as error model. This is a small gain. However, when statistically testing the line segments whether they belong to the vanishing points, the decision depends on the assumed stochstical model for the line segments: When compared to short segments, the test will be much more sensitive for long segments in case model α is used, as when model β is used, which appears to be reasonable. Fitting a 3D line though points. We now give an example proving the validity of the approach again using simulated data in order to have full access to the uncertainty of the given data. Given are I uncertain 3D points Xi (X i , ΣXi Xi ), i = 1, ..., I, whose true values are supposed to sit on a 3D line L . The two constraints for the incidence of point Xi and the line L can be written as ˜ i , L) ˜ = 1 (L) ˜ X ˜ i = 2 (X ˜ i )L ˜ =0 g i (X
(21)
Minimal Representations for Uncertainty and Estimation
629
with the Jacobians 1 (L) and 2 (Xi ) depending on the homogeneous coordinates of the points and the line. The incidence constraint of a point and a line can be expressed with the Plcker matrix of the dual line: I (L)X = 0. As from these four incidence constraints only two are linearly independent, we select those two, where the entries in I (L) have maximal absolute value, leading to the two constraints g i (Xi , L) = 1 (L)Xi for each point with the 2 × 4 matrix 1 (L). The Jacobian 2 (Xi ) = ∂g i /∂L then is a 2 × 6-matrix, cf. [16, sect. 2.3.5.3]. Reduction of these constraints leads to g i (Xi , L) = 1r (L)Xi = r2 (Xi )L = 0 with the reduced Jacobians 1r (L) and r2 (Xi ) having size 2×3 and 2×4 leading to a 4 × 4 normal equation matrix, the inverse of (17). Observe, the estimation does not need to incorporate the Plcker constraint, as this is taken into account after the estimation of Lr by the back projection (10). We want to perform two tests of the setup: (1) whether the estimated variance factor actually is F -distributed, and (2) whether the theoretical covariance matrix ΣL, L corresponds to the empirical one. For this we define true line pa˜ generate I true points X ˜ i on this line, choose individual covariance rameters L, matrices ΣXi Xi and perturb the points according to a normal distribution with zero mean and these covariance matrices. We generate M samples, by repeating the perturbation M times. We determine initial estimates for the lines using an algebraic minimization, based on the constraints (21), which are linear in L and perform the iterative ML-estimation. The iteration is terminated in case the corrections to the parameters are smaller than 0.1 their standard deviation. We first perform a test where the line passes through the unit cube and contains I = 100 points with, thus G = 200 and U = 4. The standard deviations of the points vary between 0.0002 and 0.03, thus range to up to 3% relative accuracy referring to the distance to their centroid, which is comparably large. As 2 each weighted sum of squared residuals Ω is χ -distributed with G−U = R = 196 degrees of freedom, the sum Ω = samples Ωm is m ωm of the independent 2 2 2 χMR distributed, thus the average variance factor σ 0 =
0 m is FMR,∞ mσ distributed. The histogram of M = 1000 samples σ
0 2m is shown in fig. 3, left. with Σ , the CramerSecond, we compare the sample covariance matrix Σ LL L,L Rao-bound and a lower bound for the true covariance matrix, using the test for = Σ from [14, eq. (287.5)]. As both covariance mathe hypothesis H0 : Σ LL L,L trices are singular we on one side reduce the theoretical covariance matrix to obtain ΣL r ,L r and on the other side reduce the estimated line parameters which rm } from the set {L we use to determine the empirical covariance matrix Σ Lr Lr of the M estimated 3D lines. The test statistic indicates that the hypothesis of the two covariance matrices being was not rejected. Estimating 3D lines from image lines segments. The following example demonstrates the practical use of the proposed method: namely determining 3D lines from image line segments. Fig. 3 shows three images taken with a CANON 450D. The focal length was determined using vanishing points, the principle point was assumed to be the image centre, the images were not corrected for lens distortion. The images then have been mutually oriented using a bundle adjustment
630
W. F¨ orstner
1
5 2 9
3
1
5
1
6
6 11
10
10 9 3
7 8
4 12
4
5 6
2
2
10
11
9 3
7 8 12
11
7 8
4 12
2 Fig. 3. Left: Histogram of estimated variance factors σ determined from M = 0m 1000 samples of a 3D line fitted through 100 points. The 99 % conficence intervall for the Fisher distributed random variable σ 0 2 is [0.80,1.25]. Right: three images with 12 corresponding straight line segements used for the reconstruction of the 3D lines, forming three groups [1..4],[5...8], [9...12] for three main directions.
program. Straight line segments were automatically detected and a small subset of 12, visible in all three and pointing in the three principle directions were manually brought into correspondence. From the straight lines lij , i = 1, ..., 12; j = 1, 2, 3 and the projection matrices j we determine the projection planes Aij = T j lij . For determining the ML-estimates of the 12 lines Li we need the covariance matrices of the projection planes. They are determined by variance propagation based on the covariance matrices of the image lines lij and the covariance matrices of the projection matrices. As we do not have the crosscovariance matrices between any two of the projection matrices, we only use the uncertainty ΣZj Zj of the three projection centres Zj . The covariance matrices of the straight line segments are derived from the uncertainty given by the feature extraction as in the first example. The covariance matrix of the projection planes T then is determined from ΣAij Aij = T i Σli li i + ( 4 ⊗ lij )Σpi pi ( 4 ⊗ lij ). Now we observe, that determining the intersecting line of three planes is dual to determining the best fitting line through three 3D points. Thus the procedure for fitting a 3D line through a set of 3D points can be used directly to determine the ML-estimate of the best fitting line through three planes, except for the fact, that the result of the estimation yields the dual line. First, the square roots of the estimated variance factors σ 02 = Ω/(G − U ) range between 0.03 and 3.2. As the number of degrees of freedom is G − U = 2I − 4 = 2.3 − 4 = 2 in this case is very low, such a spread is to be expected. The mean value for the variance factor is 1.1, which confirms the model to fit to the data. As a second result we analyse the angles between the directions of the 12 lines. As they are clustered into three groups corresponding to the main directions of the building, we should find values close to 0◦ within a group and values close to 90◦ between lines of different groups. The results are collected in the following table. The angles between lines in the same group scatter between 0◦ and 14.5◦ , the angles between lines of different orientation differ from 90◦ between 0◦ and 15◦ . The standard deviations of the angles scatter between 0.4◦ and 8.3◦ , this is why none of the deviations from 0 or 90◦ are significant. The statistical analysis obviously makes the visual impression objective.
Minimal Representations for Uncertainty and Estimation lmin \# 173 [pel] 155 [pel] 72 [pel] 62 [pel] 232 [pel] 153 [pel] 91 [pel] 113 [pel] 190 [pel] 82 [pel] 103 [pel] 225 [pel]
1 2 3 4 − 2.6◦ 2.7◦ 3.0◦ ◦ ◦ 0.7 − 0.7 1.6 0.7 0. − 0.9◦ 1.2 0.3 0.1 − 0.6 0.0 0.1 0.2 0.3 0.1 0.0 0.1 0.5 0.8 0.4 0.2 1.0 1.1 1.1 0.9 1.6 0.4 0.3 0.1 1.4 0.3 0.2 0.0 1.6 0.5 0.4 0.2 2.4 1.1 1.1 1.0
5 88.6◦ ◦ 89.9 89.4◦ 88.5◦ − 0.1 0.8 1.1 0.2 0.4 0.2 0.5
6 89.0◦ ◦ 89.7 89.8◦ 88.9◦ 0.4◦ − 0.6 1.1 0.3 0.4 0.2 0.5
7 88.7◦ ◦ 87.3 87.9◦ 88.7◦ 2.8◦ 2.4◦ − 0.8 1.0 1.2 0.9 1.3
8 76.5◦ ◦ 75.1 75.6◦ 76.4◦ 15.3◦ 14.9◦ 12.5◦ − 1.3 1.4 1.2 1.5
9 86.7◦ ◦ 89.0 89.4◦ 89.8◦ 89.6◦ 89.5◦ 89.0◦ 87.0◦ − 0.5 0.3 3.2
10 87.0◦ 89.2◦ 89.6◦ 90.0◦ 89.3◦ 89.2◦ 88.7◦ 86.7◦ 0.4◦ − 0.6 3.6
11 86.6◦ 88.9◦ 89.3◦ 89.7◦ 89.7◦ 89.6◦ 89.1◦ 87.2◦ 0.2◦ 0.5◦ − 4.0
631
12 85.2◦ 87.4◦ 87.8◦ 88.2◦ 89.1◦ 89.1◦ 88.6◦ 87.0◦ 1.6◦ 1.8◦ 1.6◦ −
Fig. 4. Result of determining 12 lines from Fig. 3. Left column: minimal length l of the three line segments involved, upper right triangle: angels between the lines. Lower left triangle: test statistic for the deviation from 0◦ or 90◦ .
3
Conclusions and Outlook
We developed a rigorous estimation scheme for all types of entities in projective spaces, especially points, lines and planes in 2D and 3D, together with the corresponding transformations. The estimation requires only the minimum number of parameters for each entity, thus avoids the redundancy of the homogeneous representations. Therefore no additional constraints are required to enforce the normalization of the entities, or to enforce additional constraints such as the Plcker constraints for 3D lines. In addition we not only obtain a minimal representation for the uncertainty of the geometric elements, but also simple means to determine the Mahalanobis distance between two elements, which may be used for testing or for grouping. In both cases the estimation requires the covariance matrices of the observed entities to be invertible, which is made possible by the proposed reduced representation. We demonstrated the superiority of rigorous setup compared to a suboptimal classical method of determining vanishing points, and proved the rigour of the method with the estimation of 3D lines from points or planes. The convergence properties when using the proposed reduced representation does not change as the solutions steps are algebraically equivalent. The main advantage of the proposed concept is the ability to handle elements at or close to infinity without loosing numerical stability and that it is not necessary to introduce additional constraints to enforce the geometric entities to lie on their manifold. The concept certainly can be extended to higher level algebras, such as the geometric or the conformal algebra (see [10]) where the motivation to use minimal representations is even higher than in our context. Acknowledgement. I acknowledge the valuable recommendations of the reviewers.
References 1. ˚ Astrom, K.: Using Combinations of Points, Lines and Conics to Estimate Structure and Motion. In: Proc. of Scandinavian Conference on Image Analysis (1998)
632
W. F¨ orstner
2. Bartoli, A., Sturm, P.: Nonlinear estimation of the fundamental matrix with minimal parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 426–432 (2004) 3. Bartoli, A., Sturm, P.: Structure-from-motion using lines: Representation, triangulation and bundle adjustment. Computer Vision and Image Understanding 100 (2005) 4. Begelfor, E., Werman, M.: How to put probabilities on homographies. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1666–1670 (2005) 5. Bohling, G.C., Davis, J.C., Olea, R.A., Harff, J.: Singularity and Nonnormality in the Classification of Compositional Data. Int. Assoc. for Math. Geology 30(1), 5–20 (1998) 6. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: Proc. Computer Vision and Pattern Recognition, pp. 8–15 (1998) 7. Collins, R.: Model Acquisition Using Stochastic Projective Geometry. PhD thesis, Department of Computer Science, University of Massachusetts (1993); Also published as UMass Computer Science Technical Report TR95-70 8. Criminisi, A.: Accurate Visual Metrology from Single and Multiple Uncalibrated Images. In: Distinguished Dissertations. Springer, Heidelberg (2001) 9. Frstner, W.: Optimal Vanishing Point Detection and Rotation Estimation of Single Images of a Legolandscene. In: Int. Archives for Photogrammetry and Remote Sensing, vol. XXXVIII/3A, Paris (2010) 10. Gebken, C.: Comformal Geometric Algebra in Stochastic Optimization. PhD Thesis, University of Kiel, Institute of Computer Science (2009) 11. Heuel, S.: Uncertain Projective Geometry: Statistical Reasoning For Polyhedral Object Reconstruction. LNCS. Springer, New York (2004) 12. Jain, A., Kurz, C., Thorm¨ alen, T., Seidel, H.P.: Exploiting Global Connectivity Constraints for Reconstruction of 3D Line Segments from Images. In: Conference on Computer Vision and Pattern Recognition (2010) 13. Kanatani, K.: Statistical Optimization for Geometric Computation: Theory and Practice. Elsevier Science, Amsterdam (1996) 14. Koch, K.R.: Parameter Estimation and Hypothesis Testing in Linear Models. Springer, Heidelberg (1999) 15. Liebowitz, D.: Camera Calibration and Reconstruction of Geometry from Images. PhD thesis, University of Oxford, Dept. Engineering Science (2001) 16. McGlone, C.J., Mikhail, E.M., Bethel, J.S.: Manual of Photogrammetry. Am. Soc. of Photogrammetry and Remote Sensing (2004) 17. Meidow, J., Beder, C., F¨ orstner, W.: Reasoning with uncertain points, straight lines, and straight line segments in 2D. International Journal of Photogrammetry and Remote Sensing 64, 125–139 (2009) 18. Pottmann, H., Wallner, J.: Computational Line Geometry. Springer, Heidelberg (2010) 19. Rosenhahn, B., Granert, O., Sommer, G.: Monocular pose estimation of kinematic chains. In: Dorst, L., Doran, C., Lasenby, J. (eds.) Applications of Geometric Algebra in Computer Science and Engineering, Proc. AGACSE 2001, Cambridge, UK, pp. 373–375. Birkh¨ auser, Boston (2002) 20. Sturm, P., Gargallo, P.: Conic fitting using the geometric distance. In: Proceedings of the Asian Conference on Computer Vision, Tokyo, Japan, vol. 2, pp. 784–795. Springer, Heidelberg (2007)
Personalized 3D-Aided 2D Facial Landmark Localization Zhihong Zeng, Tianhong Fang, Shishir K. Shah, and Ioannis A. Kakadiaris Computational Biomedicine Lab Department of Computer Science University of Houston 4800 Calhoun Rd, Houston, TX 77004
Abstract. Facial landmark detection in images obtained under varying acquisition conditions is a challenging problem. In this paper, we present a personalized landmark localization method that leverages information available from 2D/3D gallery data. To realize a robust correspondence between gallery and probe key points, we present several innovative solutions, including: (i) a hierarchical DAISY descriptor that encodes larger contextual information, (ii) a Data-Driven Sample Consensus (DDSAC) algorithm that leverages the image information to reduce the number of required iterations for robust transform estimation, and (iii) a 2D/3D gallery pre-processing step to build personalized landmark metadata (i.e., local descriptors and a 3D landmark model). We validate our approach on the Multi-PIE and UHDB14 databases, and by comparing our results with those obtained using two existing methods.
1
Introduction
Landmark detection and localization are active research topics in the field of computer vision due to a variety of potential applications (e.g., face recognition, facial expression analysis, and human computer interaction). Numerous researchers have proposed a variety of approaches to build general facial landmark detectors. Most methods follow a learning-based approach where the variations in each landmark’s appearance as well as the geometric relationship between landmarks are characterized based on a large amount of training data. These methods are typically generalized to solve the problem of landmark detection for any input image and not necessarily use any information from the gallery images. In contrast to the existing methods, we propose a personalized landmark localization framework. Our approach provides a framework wherein the problem of training across large data sets is alleviated, and instead focuses on the efficient encoding of landmarks in the gallery data and their counterparts in the probe image. Our framework includes two stages. The first stage allows for off-line processing of the gallery data to generate an appearance model for each landmark and a 3D landmark model that encodes the relationship between the landmarks. The second stage is performed on-line on the input probe image to identify potential R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 633–646, 2011. Springer-Verlag Berlin Heidelberg 2011
634
Z. Zeng et al.
Fig. 1. Diagram of our personalized landmark localization pipeline. The upper part represents the gallery data processing and the bottom part depicts the landmark localization step in a probe image.
landmarks and establish a match between the models derived at the first stage using a constrained optimization scheme. We propose several computational methods to improve the accuracy and efficiency of landmark detection and localization. Specifically, we extend the existing DAISY descriptor [1–3] and propose a hierarchical formulation that incorporates contextual information at multiple scales. Motivated by robust sampling approaches, we propose a Data-Driven SAmple Consensus (DDSAC) method for estimating the optimal landmark configuration. Finally, we introduce a hierarchical landmark searching scheme for efficient convergence of the localization process. We test our method on the CMU Multi-PIE [4] and the UHDB14 [5] databases. Our results obtained are compared to the latest version of a method based on the Active Shape Model [6, 7] and a state-of-the-art classifier-based facial feature point detector [8, 9]. The experimental results demonstrate the advantages of our proposed solution.
2
Related Work
The problem of facial landmark detection and localization has attracted much attention from researchers in the field of computer vision. Most of the existing efforts toward this task follow the spirit of the Point Distribution Model (PDM) [10]. The implementation of a PDM framework includes two main subtasks: local search for PDM landmarks in the neighborhood of their current estimates, and optimization of the PDM configuration such that the global geometric compatibility is jointly maximized. For the first subtask, most of the existing efforts are based on an exhaustive independent local search for each landmark using feature detectors. A common theme is to build a probability distribution for the landmark locations in order to reduce sensitivity to local minima. Both parametric [11, 12] and non-parametric representations [13] of the landmark distributions have been proposed. For the
Personalized 3D-Aided 2D Facial Landmark Localization
635
second subtask, most of the studies are based on a 2D geometric model with similarity or affine transformation constraints (e.g., [6, 11, 14–16]). Few studies (e.g., [17, 18]) have investigated 3D generative models and weak-perspective projection constraints. For the optimization process, most existing approaches use gradient descent methods (e.g., [6, 10]). A few researchers have also used MCMC [19, 20] and RANSAC [17] that theoretically are able to find the global optimal configuration, but in practice these methods are hampered by the curse of dimensionality. Furthermore, classifier-based (e.g., [8, 21]) and regression-based (e.g., [15, 22]) approaches have also been used. Across all methods, a large training database has been the key to enhancing the power of detecting landmarks and modeling their variations. For example, Liu et al. [19] use 280 distributions modeled by GMM to learn the landmark variations in both position and appearance. Similarly, Liang et al. [21] use 4,002 images for training classifiers and 120,000 image patches to learn directional classifiers for each facial component. Our objective is to explore the possibility of leveraging the gallery data to reduce the cost of training. The study closest to our work is the recent investigation by Li et al. [23], where they use both the training data and a gallery image to build a person-specific appearance landmark model to improve the accuracy of landmark localization. However, there are several differences between our work and their work: (i) our method employs only one gallery image whereas Li et al. [23] also use additional training data and (ii) our method uses the 3D geometric relationship of landmarks to constrain the search.
3 3.1
Method Framework
Our personalized landmark localization framework (URxD-L) includes two stages (Fig. 1). In the gallery processing step, personalized landmark metadata (i.e., landmark descriptor templates and a 3D landmark model) are computed. When only 2D images are available in the gallery, we use a statistical landmark model and 2D-3D fitting method to obtain a 3D personalized landmark model. Next, the landmark descriptor templates are computed on the projected landmark positions. If 3D data is also available in the gallery, we can obtain a personalized 3D landmark model directly from the annotated landmarks on the 3D data. We also use an Annotated Face Model (AFM) [24] to align the 3D face data and generate multiple images from a variety of viewpoints, which are then used to compute multi-view landmark descriptors. Landmark localization in a probe image is achieved through correspondence between the landmark descriptors in the gallery image and those in the probe image, constrained by the geometric relationship derived for the specific individual. This process includes efficient descriptor matching and robust estimation. 3.2
Matching Problem Formulation
In our framework, landmark localization is posed as an energy minimization problem. The energy function E(X) depends on X = {xj , j = 1, · · · , n}, which
636
Z. Zeng et al.
denotes the configuration to be estimated, whereas xj denotes the coordinates of the j th landmark. Specifically, E(X) is defined as a weighted sum of two energy terms: E(X) = λa E a (X) + λg E g (X), (1) where λa and λg are non-negative scalar weights (λa + λg = 1), and the energy terms E a (X) and E g (X) denote the associated appearance and geometric terms, respectively. The energy term E a (X) is a sum of unary terms that favors correspondence between similar appearance descriptors across two images: E a (X) = ϕk , (2) k∈K
where ϕk is the L2 distance of the descriptor pair between two images, and K is the set of all correspondences between the gallery and probe images. The energy term E g (X) is a measure of geometric compatibility of correspondences between the gallery image and the probe image and it favors the correspondences that are compatible to the geometric relationship of facial landmarks (the shape of the landmarks) with a predefined transformation. In this paper, we use weak-perspective projection of the 3D landmark model to form the geometric constraints. Thus, E g (X) penalizes any deviation from these constraints and is given by: (3) E g (X) = ||X − U ||, where U is the target configuration and X is estimated configuration. We define E g as the L2 distance between the projected landmarks from the 3D landmark template and their estimated positions in the probe image. 3.3
Personalized 3D Facial Shape Model
The objective of the gallery processing is to obtain a personalized 3D landmark model representing the geometry of the facial landmarks (of the gallery data), which can be detected automatically or through manual annotations. If the gallery includes 3D data, then this geometric relationship is constructed directly from the 3D landmark positions, obtained either manually or automatically [25]. In the case that the gallery contains only 2D images, the geometry is recovered through the use of a statistical 3D landmark model and a 2D-3D fitting process. 3D Statistical Landmark Model: A generic 3D statistical shape model is built based on the BU-3DFE database [26]. The BU-3DFE is a 3D facial expression database collected at State University of New York at Binghamton. We select only the datasets with neutral expressions, one per each of the 100 subjects. Considering the ambiguity of the face contour landmarks, we discard those and keep the 45 landmarks from different facial regions. By aligning the selected 3D landmarks and applying Principal Component Analysis, the statistical shape model is formulated as follows:
Personalized 3D-Aided 2D Facial Landmark Localization
S = S0 +
L
αj S j ,
637
(4)
j=1
where S0 is the 3D mean landmark model, S j is the j th principal component, and L denotes the number of principal components retained in the model. 2D-3D landmark fitting process: A regular 3D-to-2D transformation can be approximated with a translation, a rotation and a weak-perspective projection: f 00 t P = Rγ Rζ Rφ x , (5) 0f0 ty where f is the focal length, tx , ty are the 2D translations, and Rγ , Rζ , Rφ denote image plane rotation, elevation rotation, and azimuth rotation, respectively. The configuration X is given by: X = P (S0 +
L
αj S j ).
(6)
j=1
Based on the correspondences between the 3D landmark model and the 2D gallery image with known landmark positions U , we can estimate the parameters of the transformation matrix P and the shape coefficients α = (αj , j = 1, · · · , L) according to the maximum-likelihood solution as follows: {Pˆ , α ˆ } = argmin||P (S0 + P,α
L
αj S j ) − U ||, Pˆ = U (S0 )+ , α ˆ = A+ (U − Pˆ S0 ), (7)
j=1
where A is a matrix which is constructed by P S j . The details are presented by Romdhani et al. [17]. When detecting landmarks in a probe image, the shape coefficients α are considered to be person-specific and are fixed, while only the transformation P needs to be estimated. The deformation of the personalized 3D face model due to expression variations is not considered in this work. 3.4
Hierarchical DAISY Representation of a Facial Landmark
A local feature descriptor is used to established a match between landmarks in the gallery and probe images. Inspired by the DAISY descriptor [1–3] and ancestry context [27], we propose a hierarchical DAISY descriptor that augments the discriminability of DAISY by including larger context information via a multi-resolution approach (Fig. 2). The contextual cues of facial landmarks are naturally encoded in the face image with this hierarchical representation that relates different components of a face (i.e., nose, eyes and mouth). We construct the hierarchical DAISY descriptor of a facial landmark by progressively enlarging the window of analysis, which is effectively achieved by decreasing image resolution. Given a face image and a landmark location, we
638
Z. Zeng et al.
(a)
(b)
(c)
Fig. 2. The hierarchical daisy descriptor for the left inner eye corner. (a-c) Images of different resolutions with the corresponding DAISY descriptors.
define the ancestral support regions of a landmark as the set of regions that are obtained by sequentially decreasing the image resolution, covering not only the local context information but also global context information. Keeping the patch size (pixel units) of the DAISY descriptor constant, we compute the DAISY descriptors in these ancestral support regions. Then, these descriptor vectors are concatenated to form a hierarchical DAISY descriptor. 3.5
Data-Driven SAmple Consensus (DDSAC)
We propose a Data-Driven SAmple Consensus (DDSAC) that is inspired by the principle of RANSAC (RANdom SAmple Consensus) and importance sampling. Although RANSAC has been successfully used for many computer vision problems, several issues still remain open [28]. Examples include the convergence to the correct estimated transformation with fewer iterations, and tuning the parameter that distinguishes inliers from outliers (this parameter is usually set empirically). With these open issues in mind, we propose a Data-Driven SAmple Consensus (DDSAC) approach, which follows the same hypothesize-and-verify framework but improves the traditional RANSAC by using (i) data-driven sampling and (ii) a least median error criterion for selecting the best estimate. The image observations are used as a proposal sampling probability to choose a subset of landmarks, in order to use fewer iterations to obtain the correct landmark configuration. The proposal probability is designed so that it is more likely to choose a landmark subset that is more compatible with the geometric shape of the landmarks as well as their appearance. Given a probe image I, we first search for the best descriptor matching candidates of the landmarks {y j = (uj , v j ), j = 1, · · · , n} where uj and v j are the position and the descriptor of the candidate of the j th landmark, respectively. If we construct a template tuple by B j = (xj , Γ j ), where Γ j is the descriptor template and xj is the initial position (or estimated position from the previous iteration), the proposal probability of choosing j th landmark P (B j ; y j ) can be formulated as: P (B j ; y j ) = P (xj , Γ j ; uj , v j ) = βp(xj |uj ) + (1 − β)q(Γ j |v j ),
(8)
Personalized 3D-Aided 2D Facial Landmark Localization
639
where β is the weight of the probability derived from the geometry-driven process p(xj |uj ), and (1 − β) is the weight for the probablity based on appearance q(Γ j |v j ). The term p(xj |uj ) in Eq. 8 is given by: N (uj ; xj , σ) , r r r=1,··· ,n N (u ; x , σ)
p(xj |uj ) =
(9)
where N (uj ; xj , σ) is the probability that the j th landmark point is located at uj given the landmark initial position xj . In this work, we model N (uj ; xj , σ) using a Gaussian distribution with mean the initial position xj . In order to reduce the sensitivity to initialization, we use robust estimation to compute xj as follows. First, a large search area is defined to obtain the best descriptor match of the landmarks. Second, we compute the distances between these match locations and the initial landmark positions. Finally, the initial configuration X is translated based on the match that yields the median of these distances. Algorithm 1. Data-Driven SAmple Consensus (DDSAC) Input: A tentative set of corresponding points from descriptor template matching and a geometric constraint (i.e., personalized 3D landmark model and a weak-perspective transformation) 1. Select elements with the proposal probability defined by Eqn. 8 2. Combine the selected points and delete the duplicates to obtain a subset of samples 3. Estimate the transformation parameters (i.e., generate a hypothesis) 4. Project the template shape and compute the median of errors between the projected locations and locations of landmark candidates 5. Repeat Steps 1-4 for m iterations 6. Select the estimated configuration with the least median error, and sort the landmark candidates by the distances between their positions and this configuration 7. Re-estimate the configuration based on the first k landmark candidates in the list sorted in ascending order Output: The estimated transformation and matching results
The second term in Eq. 8, q(Γ j |v j ), defines the uncertainty of the best descriptor match position for the j th landmark based on the descriptor similarity. It can be formulated as follows: q(Γ j |v j ) =
P (v j ; Γ j , Ω j ) r r r r=1,··· ,n P (v ; Γ , Ω )
(10)
where P (v j ; Γ j , Ω j ) is the probability of j th landmark point at position uj given the landmark descriptor template Γ j . A Gaussian distribution is also used here to define P (v j ; Γ j , Ω j ) with the mean being the descriptor template from the gallery image. The pipeline for estimating the landmark matches and the shape transformation is presented in Alg. 1.
640
Z. Zeng et al.
Algorithm 2. 2D gallery processing Input: A 2D face image associated with a subject ID and the annotated landmarks 1. Build a personalized 3D landmark model (geometric relationship of landmarks) by using landmark correspondences between the 2D images and the 3D statistical landmark model (Eq. 7) 2. Compute the personalized descriptors of the annotated landmarks from the 2D image Output: Personalized metadata (landmark descriptors and a 3D landmark model).
Algorithm 3. 3D gallery processing Input: A 3D textured data associated with a subject ID and the annotated landmarks 1. Align the 3D face data with the Annotated Face Model (AFM) [24] 2. Orthographically project the 3D texture surface and landmark positions to predefined viewing angles 3. Compute the descriptors at the projected locations of the landmarks on the projections Output: Personalized metadata (multi-view landmark descriptors and a 3D landmark model).
In our experiments, we empirically set m = 50, β = 0.8, σ = 0.1W (W being the width of the face image) and Ω j = 10. Figure 5(b) depicts the effect of selecting k or the ratio k/n on the results. Based on the sampled subset of the tentative correspondences, we are able to estimate the parameters in the weak-perspective matrix P with the maximum-likelihood solution described in Section 3.3. 3.6
Gallery/Probe Processing
The algorithms for 2D gallery, 3D gallery, and probe processing are presented in Alg. 2, Alg. 3, and Alg. 4, respectively. For 2D gallery image processing (frontal face image without 3D data), we run the ASM algorithm [6, 7] to automatically obtain 68 landmarks in order to reduce the annotation burden. We selected among them 45 landmarks from eyebrows, eyes, nose and mouth that have correspondences in our 3D statistical landmark model. Then, we manually correct the erroneous annotations from ASM to obtain accurate ground truth. For 3D gallery face data, we manually annotate each of the landmarks on the 3D surface. Given a probe image, we use the hierarchical optimization method to localize the landmarks from coarse level to fine level as presented in Alg. 4.
Personalized 3D-Aided 2D Facial Landmark Localization
641
Algorithm 4. Hierarchical landmark localization on a probe image Input: A 2D probe face image and the corresponding gallery metadata 1. Scale the gallery landmarks to match the size of the face region 2. Align the geometric center of the landmarks with that of the face image 3. Optimize from coarse to fine resolution: a. Generate candidates by the descriptor template matching b. Estimate the translation by the median of distances between candidates and the initialization c. Run DDSAC to obtain matching results and estimate the transformation Output: Estimated landmark locations
4
Experiments
We evaluated our method on the Multi-PIE [4] and the UHDB14 [5] databases. Both databases include facial images with pose and illumination variations. The Multi-PIE dataset includes only 2D images while the UHDB14 includes both 3D data and the associated texture images for all subjects. We compare the results of our method to those obtained by the latest ASM-based approach [6, 7] and a classifier-based facial landmark detector [8, 9]. 4.1
Multi-PIE
The Multi-PIE database collected at CMU contains facial images from 337 subjects in four acquisition sessions over a span of six months, and under a large variety of poses, illumination and expressions. In our experiments, we chose the data from 129 subjects who attended all four sessions. We used the frontal neutral face images in Session 1 under Lighting 16 (top frontal flash) condition as the gallery, and the neutral face images of the same subjects with different poses and lighting conditions in Session 2 as the probes. The probe data were divided into cohorts defined in Table 1, each with 129 images from 129 subjects. We obtained 45 landmarks in gallery images by using the method described in Section 3.6. In addition, eight landmarks on the probe images were manually annotated as the ground truth for evaluation. The original images were cropped to 300 × 300 pixels, ensuring that all landmarks are enclosed in the bounding box. Examples of gallery and probe images are depicted in Fig. 3. Table 1. Probe subset definitions of the Multi-PIE database Pose 051 (0◦ ) Pose 130 (30◦ ) 16 (top frontal flash) C1 C2 lighting 14 (top right flash) C3 C4 00 (no flash) C5 C6
642
Z. Zeng et al.
Fig. 3. Examples of gallery and probe images from Multi-PIE. (a) Gallery image from Session 1 with Light 16 and Pose 051 (b-g) Probe images from Session 2 with same ID as the gallery. Example images with Light 16 and Pose 051 (b), Light 16 and Pose 130 (c), Light 14 and Pose 051 (d), Light 14 and Pose 130 (e), Light 00 and Pose 051 (f), Light 00 and Pose 130 (g).
(a)
(b)
(c)
(d)
Fig. 4. Gallery image processing and landmark localization on a probe image. (a) Gallery image with annotations (yellow circles) and fitting results (green crosses). (b) Personalized 3D landmark model. (c-d) Matching results [initialization (black crosses), best descriptor matches (red circles), geometrical fitting (green crosses)] on two levels of resolutions, respectively.
The matching process is depicted in Fig. 4. First, the landmark descriptors are computed on the gallery image and the 3D personalized landmark model is constructed from fitting the 3D statistical landmark model, as illustrated in Figs. 4a and 4b. Then, from coarse to fine resolution, the landmarks are localized by descriptor matching and geometrical fitting, as illustrated in Figs. 4c and 4d. The experimental results are shown in Figs. 5 and 6. All of the results are evaluated in terms of the mean of the normalized errors, which are computed as the ratio of the landmark errors (i.e., distance between the estimated landmark position and the ground truth) and the intra-pupillary distance. The average intra-pupillary distance is about 72 pixels among the frontal face images (cohorts: C1, C3 and C5), and about 62 pixels among images with varying poses (cohorts: C2, C4 and C6). Figure 5(a) depicts the results of our method (URxD-L) using different levels of the DAISY descriptor. It can be observed that the two-level descriptor with larger context region results in an improvement over the one-level descriptor in all probe subsets. Figure 5(b) depicts the effect of ratio (k/n) of landmarks used in DDSAC on the estimation results, where k and n are the number of selected landmarks and the total number of landmarks, respectively. This finding suggests that robust estimation (ratio 0.9-0.4) is better than non-robust estimation (ratio 1.0) in dealing with outliers. Note that the results on probe subsets with frontal
Personalized 3D-Aided 2D Facial Landmark Localization
643
Fig. 5. (a) Normalized error of URxD-L using different levels of DAISY descriptor (b) Effect of ratio (k/n) of landmarks on the estimation results (c) Means of normalized errors of different landmarks under different poses and lighting conditions.
Fig. 6. Comparison among ASM-based [6, 7], Vukadinovic & Pantic detector [8, 9] and URxD-L on Multi-PIE
pose (same as the gallery) have ”flatter” curves with lower error rates than other probe subsets. This can be attributed to the fact that frontal face images have less outliers than the non-frontal face images. Figure 5(c) illustrates the normalized localization errors of 8 landmarks (right/left eye outer/inner corners, right/left nostril corners, and right/left mouth corners). We can observe that the eye inner corners are more reliable landmarks in frontal face images. As expected, for face images that are non-frontal (e.g., 30◦ yaw) the landmarks on the left face region are more reliable than those on the right side except for left eye outer corner, which is often occluded by glasses. Figure 6 presents a comparison of landmark localization using three different methods, including the latest ASM-based approach [6, 7], a classifier-based landmark detector (Vukadinovic & Pantic) [8, 9], and URxD-L. The results show that our method can achieve similar or better results than the ASM-based and Vukadinovic & Pantic detector. They also suggest that a gallery image and robust descriptors are very helpful in reducing training burden and achieving good results. 4.2
UHDB14 Database
To evaluate the performance of URxD-L for the case where 3D data are available in the gallery, we collected a set of 3D data with associated 2D texture images using two calibrated 3dMD cameras [29] and a set of 2D images of the same subjects using a Canon XTi DSLR camera. This database, which we name
644
Z. Zeng et al.
Fig. 7. (a-d) Examples of gallery and probe cohorts from UHDB14. Gallery images generated from the 3D data (a) -30◦ yaw and (b) 30◦ yaw. Probe images (c) -30◦ yaw and (d) 30◦ yaw. (e) Comparison among ASM-based [6, 7], Vukadinovic & Pantic detector [8, 9] and URxD-L on UHDB14.
UHDB14 [5], includes 11 subjects (3 females and 8 males) with varying yaw angles (-30◦ , 0◦ , 30◦ ). We use the textured 3D data as the gallery, and the 2D images with yaw angles -30◦ and 30◦ as our probe cohorts. Since the data were obtained with different sensors, the illuminations of the gallery and the probe images are different. Some sample images from the gallery and probe cohorts are shown in Figs. 7(a-d). We manually annotated 45 landmarks on the gallery 3D data around each facial components (eyebrows, eyes, nose and mouth), and 6 landmarks on the probe images as the ground truth for evaluation. The average intra-pupillary distances of the two probe cohorts with different yaw angles are 72 and 54 pixels, respectively. We evaluated our method on the 6 annotated landmarks, and compared the results against the ones from ASM-based [6, 7] and Vukadinovic & Pantic detector [8, 9] in terms of average normalized error. Figure 7(e) presents the comparison in terms of the mean of normalized landmark errors. Note that our method outperforms both methods.
5
Conclusions
In this paper, we have proposed a 2D facial landmark localization method URxDL that leverages a 2D/3D gallery image to carry out a person-specific search. Specifically, a framework is proposed in which descriptors of facial landmarks as well as a 3D landmark model that encodes the geometric relationship between landmarks are computed from processing a given 2D or 3D gallery dataset. Online processing of the 2D probe, targeted towards the face authentication problem [30], focuses on establishing correspondences between key points on the probe image and the landmarks on the corresponding gallery image. To realize this framework, we have presented several methods, including a hierarchical DAISY descriptor and a Data-Driven SAmple Consensus algorithm. Our approach is compared with two of the existing methods and it exhibits a better performance.
Personalized 3D-Aided 2D Facial Landmark Localization
645
Acknowledgement. This work was supported in part by the US Army Research Laboratory award DWAM80750 and UH Eckhard Pfeiffer Endowment. Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of our sponsors.
References 1. Tola, E., Lepetit, V., Fua, P.: DAISY: An efficient dense descriptor applied to wide baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 815–830 (2010) 2. Winder, S., Brown, M.: Learning local image descriptors. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, pp. 1–8 (2007) 3. Winder, S., Hua, G., Brown, M.: Picking the best DAISY. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami Beach, FL, pp. 178–185 (2009) 4. Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S.: Multi-PIE. In: Proc. IEEE International Conference on Automatic Face and Gesture Recognition, Amsterdam, The Netherlands (2008) 5. UHDB14, http://cbl.uh.edu/URxD/datasets/ 6. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 504–513. Springer, Heidelberg (2008) 7. STASM, http://www.milbo.users.sonic.net/stasm/ 8. Vukadinovic, D., Pantic, M.: Fully automatic facial feature point detection using Gabor feature based boosted classifiers. In: Proc. IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, Hawaii, USA, pp. 1692–1698 (2005) 9. FFPD, http://www.doc.ic.ac.uk/~ maja/ 10. Cootes, T., Taylor, C.: Active shape models: Smart snakes. In: Proc. British Machine Vision Conference, Leeds, UK, pp. 266–275 (1992) 11. Gu, L., Kanade, T.: A generative shape regularization model for robust face alignment. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 413–426. Springer, Heidelberg (2008) 12. Wang, Y., Lucey, S., Cohn, J.: Enforcing convexity for improved alignment with constrained local models. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, AK (2008) 13. Saragih, J., Lucey, S., Cohn, J.: Face alignment through subspace constrained mean-shifts. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami Beach, FL, pp. 1034–1041 (2009) 14. Cootes, T., Walker, K., Taylor, C.: View-based active appearance models. In: Proc. IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 227–232 (2002) 15. Cristinacce, D., Cootes, T.: Boosted regression active shape models. In: Proc. British Machine Vision Conference, University of Warwick, United Kingdom, pp. 880–889 (2007) 16. Liang, L., Wen, F., Xu, Y., Tang, X., Shum, H.: Accurate face alignment using shape constrained Markov network. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, pp. 1313–1319 (2006)
646
Z. Zeng et al.
17. Romdhani, S., Vetter, T.: 3D probabilistic feature point model for object detection and recognition. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Minneapolis, MN (2007) 18. Gu, L., Kanade, T.: 3D alignment of face in a single image. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, pp. 1305–1312 (2006) 19. Liu, C., Shum, H.Y., Zhang, C.: Hierarchical shape modeling for automatic face localization. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 687–703. Springer, Heidelberg (2002) 20. Tu, J., Zhang, Z., Zeng, Z., Huang, T.S.: Face localization via hierarchical condensation with fisher boosting feature selection. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 719–724 (2004) 21. Liang, L., Xiao, R., Wen, F., Sun, J.: Face alignment via component-based discriminative search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 72–85. Springer, Heidelberg (2008) 22. Valstar, M., Martinez, B., Binefa, X., Pantic, M.: Facial point detection using boosted regression and graph models. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, USA, pp. 2729–2736 (2010) 23. Li, P., Prince, S.: Joint and implicit registration for face recognition. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami Beach, FL, pp. 1510–1517 (2009) 24. Kakadiaris, I., Passalis, G., Toderici, G., Murtuza, M., Lu, Y., Karampatziakis, N., Theoharis, T.: Three-dimensional face recognition in the presence of facial expressions: An annotated deformable model approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 640–649 (2007) 25. Perakis, P., Passalis, G., Theoharis, T., Toderici, G., Kakadiaris, I.: Partial matching of interpose 3D facial data for face recognition. In: Proc. 3rd IEEE International Conference on Biometrics: Theory, Applications and Systems, Arlington, VA, pp. 439–446 (2009) 26. Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.: A 3D facial expression database for facial behavior research. In: Proc. 7th International Conference on Automatic Face and Gesture Recognition, Southampton, UK, pp. 211–216 (2006) 27. Lim, J., Arbelaez, P., Gu, C., Malik, J.: Context by region ancestry. In: Proc. 12th IEEE International Conference on Computer Vision, Kyoto, Japan, pp. 1978–1985 (2009) 28. Proc. 25 Years of RANSAC (Workshop in conjunction with CVPR), New York, NY (2006) 29. 3dMD, http://www.3dmd.com/ 30. Toderici, G., Passalis, G., Zafeiriou, S., Tzimiropoulos, G., Petrou, M., Theoharis, T., Kakadiaris, I.: Bidirectional relighting for 3d-aided 2d face recognition. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, pp. 2721–2728 (2010)
A Theoretical and Numerical Study of a Phase Field Higher-Order Active Contour Model of Directed Networks Aymen El Ghoul, Ian H. Jermyn, and Josiane Zerubia ARIANA - Joint Team-Project INRIA/CNRS/UNSA 2004 route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex, France
Abstract. We address the problem of quasi-automatic extraction of directed networks, which have characteristic geometric features, from images. To include the necessary prior knowledge about these geometric features, we use a phase field higher-order active contour model of directed networks. The model has a large number of unphysical parameters (weights of energy terms), and can favour different geometric structures for different parameter values. To overcome this problem, we perform a stability analysis of a long, straight bar in order to find parameter ranges that favour networks. The resulting constraints necessary to produce stable networks eliminate some parameters, replace others by physical parameters such as network branch width, and place lower and upper bounds on the values of the rest. We validate the theoretical analysis via numerical experiments, and then apply the model to the problem of hydrographic network extraction from multi-spectral VHR satellite images.
1
Introduction
The automatic identification of the region in the image domain corresponding to a specified entity in the scene (‘extraction’) is a difficult problem because, in general, local image properties are insufficient to distinguish the entity from the rest of the image. In order to extract the entity successfully, significant prior knowledge K about the geometry of the region R corresponding to the entity of interest must be provided. This is often in the form of explicit guidance by a user, but if the problem is to be solved automatically, this knowledge must be incorporated into mathematical models. In this paper, we focus on the extraction of ‘directed networks’ (e.g. vascular networks in medical imagery and hydrographic networks in remote sensing imagery). These are entities that occupy network-like regions in the image domain, i.e. regions consisting of a set of branches that meet at junctions. In addition, however, unlike undirected networks, directed networks carry a unidirectional flow. The existence of this flow, which is usually conserved, encourages the network region to have certain characteristic geometric properties: branches tend not to end; different branches may have very different widths; width changes slowly along each branch; at junctions, R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 647–659, 2011. Springer-Verlag Berlin Heidelberg 2011
648
A. El Ghoul, I.H. Jermyn, and J. Zerubia
total incoming width and total outgoing width tend to be similar, but otherwise width can change dramatically. Creating models that incorporate these geometric properties is non-trivial. First, the topology of a network region is essentially arbitrary: it may have any number of connected components each of which may have any number of loops. Shape modelling methods that rely on reference regions or otherwise bounded zones in region space, e.g. [1–7], cannot capture this topological freedom. In [8], Rochery et al. introduced the ‘higher-order active contour’ (HOAC) framework, which was specifically designed to incorporate non-trivial geometric information about the region of interest without constraining the topology. In [9], a phase field formulation of HOACs was introduced. This formulation possesses several advantages over the contour formulation. Both formulations were used to model undirected networks in which all branches have approximately the same width. While these models were applied successfully to the extraction of road networks from remote sensing images, they fail on directed networks in which branches have widely differing widths, and where information about the geometry of junctions can be crucial for successful extraction. Naive attempts to allow a broader range of widths simply results in branches whose width changes rapidly: branch rigidity is lost. In [10], El Ghoul et al. introduced a phase field HOAC model of directed networks. This extends the phase field model of [8] by introducing a vector field which describes the ‘flow’ through network branches. The main idea of the model is that the vector field should be nearly divergence-free, and of nearly constant magnitude within the network. Flow conservation then favours network regions in which the width changes slowly along branches, in which branches do not end, and in which total incoming width equals total outgoing width at junctions. Because flow conservation preserves width in each branch, the constraint on branch width can be lifted without sacrificing rigidity, thereby allowing different branches to have very different widths. The model was tested on a synthetic image. The work in [10] suffers from a serious drawback, however. The model is complex and possesses a number of parameters that as yet must be fixed by trial and error. Worse, depending on the parameter values, geometric structures other than networks may be favoured. There is thus no way to know whether the model is actually favouring network structures for a given set of parameter values. The model was applied to network extraction from remote sensing images in an application paper complementary to this one [11], but that paper focused on testing various likelihood terms and on the concrete application. The model was not analysed theoretically and no solution to the parameter problem was described. In this paper, then, we focus on a theoretical analysis of the model. In particular, we perform a stability analysis of a long, straight bar, which is supposed to be an approximation to a network branch. We compute the energy of the bar and constrain the first order variation of the bar energy to be zero and the second order variation to be positive definite, so that the bar be an energy
Active Contour Model of Directed Networks
649
minimum. This results in constraints on the parameters of the model that allow us to express some parameters in terms of others, to replace some unphysical parameters by physical parameters e.g. the bar width, and to place upper and lower bounds on the remaining free parameter values. The results of the theoretical study are then tested in numerical experiments, some using the network model on its own to validate the stability analysis, and others using the network model in conjunction with a data likelihood to extract networks from very high resolution (VHR) multi-spectral satellite images. In Section 2, we recall the phase field formulation of the undirected and directed network models. In Section 3, we introduce the core contribution of this paper: the theoretical stability analysis of a bar under the phase field HOAC model for directed networks. In Section 4, we show the results of geometric and extraction experiments. We conclude in Section 5.
2
Network Models
In this section, we recall the undirected and directed network phase field HOAC models introduced in [9] and [10] respectively. 2.1
Undirected Networks
A phase field is a real-valued function on the image domain: φ : Ω ⊂ R2 → R. It describes a region by the map ζz (φ) = {x ∈ Ω : φ(x) > z} where z is a given threshold. The basic phase field energy is [9] 4 φ D φ2 φ3 s 2 ∂φ · ∂φ + λ − d x E0 (φ) = +α φ− (1) 2 4 2 3 Ω For a given region R, Eq. (1) is minimized subject to ζz (φ) = R. As a result, the minimizing function φR assumes the value 1 inside, and −1 outside R thanks to the ultralocal terms. To guarantee two stable phases at −1 and 1, the inequality λ > |α| must be satisfied. We choose α > 0 so that the energy at −1 is less than that at 1. The first term guarantees that φR be smooth, producing a narrow interface around the boundary ∂R interpolating between −1 and +1. In order to incorporate geometric properties, a nonlocal term is added to give a total energy EPs = E0s + ENL , where [9] |x − x | β 2 2 ENL (φ) = − d x d x ∂φ(x) · ∂φ(x ) Ψ (2) 2 d Ω2 where d is the interaction range. This term creates long-range interactions between points of ∂R (because ∂φR is zero elsewhere) using an interaction function, Ψ , which decreases as a function of the distance between the points. In [9], it was shown that the phase field model is approximately equivalent to the HOAC model, the parameters of one model being expressed as a function of those of the other. For the undirected network model, the parameter ranges can be found via a stability analysis of the HOAC model [12] and subsequently converted to phase field model parameters [9].
650
2.2
A. El Ghoul, I.H. Jermyn, and J. Zerubia
Directed Networks
In [10], a directed network phase field HOAC model was introduced by extending the undirected network model described in Section 2.1. The key idea is to incorporate geometric information by the use, in addition to the scalar phase field φ, of a vector phase field, v : Ω → R2 , which describes the flow through network branches. The total prior energy EP (φ, v) is the sum of a local term E0 (φ, v) and the nonlocal term ENL (φ) given by Eq. (2). E0 is [10] Dv D Lv E0 (φ, v) = ∂φ · ∂φ + (∂ · v)2 + ∂v : ∂v + W (φ, v) (3) d2 x 2 2 2 Ω where W is a fourth order polynomial in φ and |v|, constrained to be differentiable: W (φ, v) =
|v|4 |v|2 φ2 + (λ22 + λ21 φ + λ20 ) 4 2 2 + λ04
φ4 φ3 φ2 + λ03 + λ02 + λ01 φ 4 3 2
(4)
Similarly to the undirected network model described in Section 2.1, where −1 and 1 are the two stable phases, the form of the the ultralocal term W (φ, v) means that the directed network model has two stable configurations, (φ, |v|) = (−1, 0) and (1, 1), which describe the exterior and interior of R respectively. The first term in Eq. (3) guarantees the smoothness of φ. The second term penalizes the divergence of v. This represents a soft version of flow conservation, but the parameter multiplying this term will be large so that in general the divergence will be small. The divergence term is not sufficient to ensure smoothness of v: a small smoothing term (∂v : ∂v = m,n (∂m v n )2 , where m, n ∈ {1, 2} label the two Euclidean coordinates) is added to encourage this. The divergence term, coupled with the change of magnitude of v from 0 to 1 across ∂R, strongly encourages the vector field v to be parallel to ∂R. The divergence and smoothness terms then propagate this parallelism to the interior of the network, meaning that the flow runs along the network branches. Flow conservation and the fact that |v| ∼ = 1 in the network interior then tend to prolong branches, to favour slow changes of width along branches, and to favour junctions for which total incoming width is approximately equal to total outgoing width. Because the vector field encourages width stability along branches, the interaction function used in [9], which severely constrains the range of stable branch widths, can be replaced by one that allows more freedom. We choose the modified Bessel function of the second kind of order 0, K0 [10]. Different branches can then have very different widths.
3
Theoretical Stability Analysis
There are various types of stability that need to be considered. First, we require that the two phases of the system, (φ, |v|) = (−1, 0) and (1, 1), be stable to
Active Contour Model of Directed Networks
651
infinitesimal perturbations, to prevent the exterior or interior of R from ‘decaying’. For homogeneous perturbations, this means they should be local minima of W , which produces constraints reducing the number of free parameter of W from 7 to 4 (these are chosen to be (λ04 , λ03 , λ22 , λ21 )). Stability to non-zero frequency perturbations then places lower and upper bounds on the values of the remaining parameters. Second, we require that the exterior (φ, |v|) = (−1, 0), as well as being locally stable, also be the global minimum of the energy. This is ensured if the quadratic form defined by the derivative terms is positive definite. This generates a further inequality on one of the parameters. We do not describe these two calculations here due to lack of space, but the constraints are used in selecting parameters for the experiments. Third, we require that network configurations be stable. A network region consists of a set of branches, meeting at junctions. To analyse a general network configuration is very difficult, but if we assume that the branches are long and straight enough, then their stability can be analysed by considering the idealized limit of a long, straight bar. The validity of this idealization will be tested using numerical experiments. Ideally, the analysis should proceed by first finding the energy minimizing φRBar and vRBar for the bar region, and then expanding around these values. In practice, there are two obstacles. First, φRBar and vRBar cannot be found exactly. Second, it is not possible to diagonalize exactly the second derivative operator, and thus it is hard to impose positive definiteness for stability. Instead, we take simple ansatzes for φRBar and vRBar and study their stability in a low-dimensional subspace of function space. We define a four-parameter family of ansatzes for φRBar and vRBar , and analyse the stability of this family. Again, the validity of these ansatzes will be tested using numerical experiments. The first thing we need to do, then, is to calculate the energy of this idealized bar configuration. 3.1
Energy of the Bar
Two phase field variables are involved: the scalar field φ and the vector field v. The ansatz is as follows: the scalar field φ varies linearly from −1 to φm across a region interface of width w, otherwise being −1 outside and φm inside the bar, which has width w0 ; the vector field v is parallel to the bar boundary, with magnitude that varies linearly from 0 to vm across the region interface, otherwise being 0 outside and vm inside the bar. The bar is thus described by the four physical parameters w0 , w, φm and vm . To compute the total energy per unit length of the bar, we substitute the bar ansatz into Eq. (2) and (3). The result is eP (wˆ0 , w, ˆ φm , vm ) = w ˆ0 ν(φm , vm ) + wμ(φ ˆ m , vm ) − β(φm + 1)2 G00 (wˆ0 , w) ˆ +
ˆ m + 1)2 + L ˆ v v2 D(φ m w ˆ
(5)
ˆ = D/d2 and L ˆ v = Lv /d2 are dimensionless pawhere w ˆ = w/d, w ˆ0 = w0 /d, D rameters. The functions μ, ν and G00 are given in the Appendix. The parameters μ and ν play the same roles as λ and α in the undirected network model given
652
A. El Ghoul, I.H. Jermyn, and J. Zerubia
by Eq. (1). The energy gap between the foreground and the background is equal to 2α in the undirected network model and ν in the directed network model, and ν must be strictly positive to favour pixels belonging to the background. The parameter μ controls the contribution of the region interface of width w: it has an effect similar to the parameter λ. 3.2
Stability of the Bar
The energy per unit length of a network branch, eP , is given by Eq. 5. A network branch is stable in the four-parameter family of ansatzes if it minimizes eP with respect to variations of w ˆ0 , w, ˆ φm and vm . This is equivalent to setting the first partial derivatives of eP equal to zero and requiring its Hessian matrix to be positive definite. The desired value of (φm , vm ) is (1, 1) to describe the interior of the region R. Setting the first partial derivatives of eP , evaluated at (w ˆ0 , w, ˆ 1, 1), equal to zero, and after some mathematical manipulation, one finds the following constraints: ρ − G(wˆ0 , w) ˆ =0 ν β= 4G10 (w ˆ0 , w) ˆ w ˆ ν G00 (w ˆ0 , w) ˆ ˆ D= − wμ ˆ φ 2 2G10 (wˆ0 , w) ˆ ˆ 2 μv ˆv = − w L 2
(6) (7) (8) (9)
where G(w ˆ0 , w) ˆ = [G00 (wˆ0 , w)/ ˆ w ˆ + G11 (wˆ0 , w)] ˆ /G10 (wˆ0 , w), ˆ ρ = [μ + μφ + μv /2]/ν , ν = ν(1, 1), μ = μ(1, 1), μφ = μφ (1, 1) and μv = μv (1, 1); μφ = ∂μ/∂φm , μv = ∂μ/∂vm , G10 = ∂G00 /∂ w ˆ0 and G11 = ∂G00 /∂ w. ˆ The starred parameters depend only on the 4 parameters of W : πλ = (λ22 , λ21 , λ04 , λ03 ). ˆ > 0 and L ˆ v > 0 generate lower and upper bounds on w ˆ0 , w ˆ The conditions D and πλ . Equation (6) shows that, for fixed πλ i.e. when ρ is determined, the set of solutions in the plane (w ˆ0 , w) ˆ is the intersection of the surface representing the function G and the plane located at ρ . The result is then a set of curves in the ˆ where each corresponds to a value of ρ . The top-left of Fig. 1 plane (w ˆ0 , w) shows examples of solutions of Eq. (6) for some values of ρ . The solutions must satisfy the condition w ˆ<w ˆ0 otherwise the bar ansatz fails. Equation (7) shows that bar stability depends mainly on the scaled parameter βˆ = β/ν . The top-right of Fig. 1 shows a plot of βˆ against the scaled bar width ˆ We have plotted the surface as two halfw ˆ0 and the scaled interface width w. surfaces: one is lighter (left-hand half-surface, smaller w ˆ0 ) than the other (righthand half-surface, larger w ˆ0 ). The valley between the half-surfaces corresponds to the minimum value of βˆ for each value of w. ˆ The graph shows that: for each ˆ = 0.0879, there are no possible values value βˆ < βˆmin = inf(1/4G10 (wˆ0 , w)) of (w ˆ0 , w) ˆ which satisfy the constraints and so the bar energy does not have
Active Contour Model of Directed Networks
653
Fig. 1. Top-left: examples of solutions of Eq. (6) for some values of ρ ; curves are ˆ the light and dark surfaces labelled by the values of ρ . Top-right: behaviour of β; show the locations of maxima and minima respectively. Bottom: bar energy eP against the bar parameters w, ˆ w ˆ0 , φm and vm . Parameter values were chosen so that there is = (1.36, 0.67, 1, 1). an energy minimum at πB
extrema; for each value βˆ > βˆmin and for some chosen value w, ˆ there are two possible values of w ˆ0 which satisfy the constraints: the smaller width (left-hand half-surface) corresponds to an energy maximum and the larger width (righthand half-surface) corresponds to an energy minimum. The second order variation of the bar energy is described by the 4 × 4 Hessian matrix H which we do not detail due to lack of space. In order for the bar to be an ˆ φm , vm ) = (w ˆ0 , w, ˆ 1, 1). energy minimum, H must be positive definite at (w ˆ0 , w, This generates upper and lower bounds on the parameter values. This positivity condition is tested numerically because explicit expressions for the eigenvalues cannot be found. The lower part of Fig. 1 shows the bar energy, eP , plotted against the physical parameters of the bar πB = (wˆ0 , w, ˆ φm , vm ). The desired energy minimum was = (1.36, 0.67, 1, 1).1 We choose a parameter setting, πλ , and then chosen at πB 1
The parameter values were: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv ) = (0.05, 0.025, 0.013, −0.6, 0.0007, 0.003, 1, 0.208, 0).
654
A. El Ghoul, I.H. Jermyn, and J. Zerubia
compute D, β and Lv using the parameter constraints given above. The bar energy per unit length eP indeed has a minimum at the desired πB . 3.3
Parameter Settings in Practice
There are nine free parameters in the model: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv ). Based on the stability analysis in Section 3, two of these parameters are elimiˆ nated while another two can be replaced by the ‘physical’ parameters w ˆ0 and w: (λ04 , λ03 , λ22 , λ21 , Dv , w ˆ0 , w). ˆ The interface width w should be taken as small as possible compatible with the discretization; in practice, we take 2 < w < 4 [9]. The desired bar width w0 is an application-determined physical parameter. We then fix the parameter values as follows: choose the 4 free parameter values (λ04 , λ03 , λ22 , λ21 ) of the ultralocal term W ; compute ν , μ , μφ , μv and then ρ , and solve Eq. (6) to give the values of the scaled widths w ˆ0 and w; ˆ comˆ and L ˆ v using the parameter constraints given by Eq. (7), (8) and (9), pute β, D ˆ and Lv = d2 L ˆ v where d = w0 /w and then compute D = d2 D ˆ0 ; choose the free parameter Dv ; check numerically the stability of the phases (−1, 0) and (1, 1) as described briefly at the beginning of Section 3; check numerically positive definiteness of the Hessian matrix H.
4
Experiments
In Section 4.1, we present numerical experiments using the new model EP that confirm the results of the theoretical analysis. In Section 4.2, we add a data likelihood and use the overall model to extract hydrographic networks from real images. 4.1
Geometric Evolutions
To confirm the results of the stability analysis, we first show that bars of different widths evolve under gradient descent towards bars of the stable width predicted by theory. Fig. 2 shows three such evolutions.2 In all evolutions, we fixed Dv = 0, because the divergence term cannot destabilize the bar when the vector field is initialized as a constant everywhere in the image domain. The width of the initial straight bar is 10. The first row shows that the bar evolves until it disappears ˆ0 = 0. because βˆ < βˆmin , so that the bar energy does not have a minimum for w On the other hand, the second and third rows show that straight bars evolve toward straight bars with the desired stable widths w0 = 3 and 6, respectively, when βˆ > βˆmin . As a second test of the theoretical analysis, we present experiments that show that starting from a random configuration of φ and v, the region evolves under 2
From top to bottom, parameter values were: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv ) = (0.1, −0.064, 0.232, −0.6, 0.065, 0.0027, 3.66, 0.5524, 0), (0.112, 0.019, −0.051, −0.6, 0.055, 0.011, 2.78, 0.115, 0) and (0.1, −0.064, 0.232, −0.6, 0.065, 0.0077, 3.66, 0.5524, 0).
Active Contour Model of Directed Networks
655
Fig. 2. Geometric evolutions of bars using the directed network model EP with the interaction function K0 . Time runs from left to right. First column: initial configurations, which consist of a straight bar of width 10. The initial φ is −1 in the background and φs in the foreground, and the initial vector field is (1, 0). First row: when βˆ < βˆmin , the initial bar vanishes; second and third rows: when βˆ > βˆmin , the bars evolve toward bars which have the desired stable widths 3 and 6 respectively. The regions are obtained by thresholding the function φ at 0.
Fig. 3. Geometric evolutions starting from random configurations. Time runs from left to right. Parameter values were chosen as a function of the desired stable width, which was 5 for the first row and 8 for the second row. Both initial configurations evolve towards network regions in which the branches have widths equal to the predicted stable width, except near junctions, where branch width changes significantly to accommodate flow conservation. The regions are obtained by thresholding the function φ at 0.
gradient descent to a network of the predicted width. Fig. 3 shows two such evolutions.3 At each point, the pair (φ, |v|) was initialized randomly to be either (−1, 0) or (1, 1). The orientation of v was chosen uniformly on the circle. The evolutions in the first row has predicted stable width w0 = 5, while that in the second row has predicted stable width w0 = 8. In all cases, the initial configuration evolves to a stable network region with branch widths equal to the 3
From top to bottom, the parameter values were: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv ) = (0.25, 0.0625, 0.0932, −0.8, 0.111, 0.0061, 3.65, 0.0412, 50) and (0.4, −0.1217, 0.7246, −1, 0.4566, 0.0083, 10.8, 0.4334, 100).
656
A. El Ghoul, I.H. Jermyn, and J. Zerubia
predicted value, except near junctions, where branch width changes significantly to accommodate flow conservation. 4.2
Results on Real Images
In this section, we use the new model to extract hydrographic networks from multispectral VHR Quickbird images. The multi-spectral channels are red, green, blue and infra-red. Fig. 4 shows three such images in which parts of the hydrographic network to be extracted are obscured in some way. To apply the model, a likelihood energy linking the region R to the data I is needed in addition to the prior term EP . The total energy to minimize is then E(φ, v) = θEP (φ, v) + EI (φ) where the parameter θ > 0 balances the two energy terms. The likelihood energy EI (φ) is taken to be a multivariate mixture of two Gaussians (MMG) to take into account the heterogeneity in the appearance of the network produced by occlusions. It takes the form EI (φ) = −
1 2
dx Ω
{ln − ln
2
1
pi |2πΣi |−1/2 e− 2 (I(x)−μi )
t
Σi−1 (I(x)−μi )
i=1 2
¯ i |−1/2 e− 12 (I(x)−¯μi )t Σ¯ i−1 (I(x)−¯μi ) }φ(x) (10) p¯i |2π Σ
i=1
¯1 , Σ ¯ 2 ) are learnt from where the parameters (p1 , p2 , p¯1 , p¯2 , μ1 , μ2 , μ ¯1 , μ ¯ 2 , Σ1 , Σ2 , Σ the original images using the Expectation-Maximization algorithm; labels 1 and 2 refer to the two Gaussian components and the unbarred and barred parameters refer to the interior and exterior of the region R respectively; t indicates transpose. Fig. 4 shows the results. The second row shows the maximum likelihood estimate of the network (i.e. using the MMG without EP ). The third and fourth rows show the MAP estimates obtained using the undirected network model,4 EPs , and the directed network model,5 EP , respectively. The undirected network model favours network structures in which all branches have the same width. Consequently, network branches which have widths significantly different from the average are not extracted. The result on the third image shows clearly the false negative in the central part of the network branch, where the width is about half the average width. Similarly, the result on the second image shows a false positive at the bottom of the central loop of the network, where the true branch width is small. The result on the first image again shows a small false negative piece in the network junction at the bottom. 4 5
The parameter values were, from left to right: (λ, α, D, β, d) = (18.74, 0.0775, 11.25, 0.0134, 34.1), (4.88, 0.3327, 3, 0.0575, 4.54) and (19.88, 0.6654, 12, 0.1151, 9.1). The parameter values were, from left to right: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv , θ) = (0.412, −0.0008, 0.0022, −0.6, 0.257, 0.0083, 8.33, 0.275, 50, 25), (0.2650, −0.1659, 0.5023, −0.8, 0.1926, 0.0387, 2.027, 0.2023, 100, 16.66) and (0.4525, −0.2629, 0.6611, −0.8, 0.3585, 0.0289, 5.12, 0.4224, 10, 22.22).
Active Contour Model of Directed Networks
657
Fig. 4. From top to bottom: multispectral Quickbird images; ML segmentation; MAP segmentation using the undirected model EPs ; MAP segmentation using the directed model EP . Regions are obtained by thresholding φ at 0. (Original images DigitalGlobe, CNES processing, images acquired via ORFEO Accompaniment Program).
658
A. El Ghoul, I.H. Jermyn, and J. Zerubia
Table 1. Quantitative evaluations of experiments of the three images given in Fig. 4. T, F, P, N, UNM and DNM correspond to true, false, positive, negative, undirected network model and directed network model respectively. Completeness = TP / (TP+FN) UNM 0.8439 1 DNM 0.8202
Correctness = TP / (TP+FP) 0.9168 0.9489
Quality = TP / (TP+FP+FN) 0.7739 0.7924
2
UNM 0.9094 DNM 0.8484
0.6999 0.7856
0.6043 0.6889
3
UNM 0.5421 DNM 0.7702
0.6411 0.6251
0.4158 0.6513
The directed network model remedies these problems, as shown by the results in the fourth row. The vector field was initialized to be vertical for the first and third images, and horizontal for the second image. φ and |v| are initialized at the saddle point of the ultralocal term W , except for the third image where |v| = 1. The third image shows many gaps in the hydrographic network due mainly to the presence of trees. These gaps cannot be closed using the undirected network model. The directed network model can close these gaps because flow conservation tends to prolong network branches. See Table 1 for quantitative evaluations.
5
Conclusion
In this paper, we have conducted a theoretical study of a phase field HOAC model of directed networks in order to ascertain parameter ranges for which stable networks exist. This was done via a stability analysis of a long, straight bar that enabled some model parameters to be fixed in terms of the rest, others to be replaced by physically meaningful parameters, and lower and upper bounds to be placed on the remainder. We validated the theoretical analysis via numerical experiments. We then added a likelihood energy and tested the model on the problem of hydrographic network extraction from multi-spectral VHR satellite images, showing that the directed network model outperforms the undirected network model. Acknowledgement. The authors thank the French Space Agency (CNES) for the satellite images, and CNES and the PACA Region for partial financial support.
References 1. Chen, Y., Tagare, H., Thiruvenkadam, S., Huang, F., Wilson, D., Gopinath, K., Briggs, R., Geiser, E.: Using prior shapes in geometric active contours in a variational framework. IJCV 50, 315–328 (2002)
Active Contour Model of Directed Networks
659
2. Cremers, D., Kohlberger, T., Schn¨ orr, C.: Shape statistics in kernel space for variational image segmentation. Pattern Recognition 36, 1929–1943 (2003) 3. Paragios, N., Rousson, M.: Shape priors for level set representations. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 78–92. Springer, Heidelberg (2002) 4. Srivastava, A., Joshi, S., Mio, W., Liu, X.: Statistical shape analysis: Clustering, learning, and testing. IEEE Trans. PAMI 27, 590–602 (2003) 5. Riklin Raviv, T., Kiryati, N., Sochen, N.: Prior-based segmentation and shape registration in the presence of perspective distortion. IJCV 72, 309–328 (2007) 6. Taron, M., Paragios, N., Jolly, M.P.: Registration with uncertainties and statistical modeling of shapes with variable metric kernels. IEEE Trans. PAMI 31, 99–113 (2009) 7. Vaillant, M., Miller, M.I., Younes, A.L., Trouv´e, A.: Statistics on diffeomorphisms via tangent space representations. NeuroImage 23, 161–169 (2004) 8. Rochery, M., Jermyn, I., Zerubia, J.: Higher order active contours. IJCV 69, 27–42 (2006) 9. Rochery, M., Jermyn, I.H., Zerubia, J.: Phase field models and higher-order active contours. In: ICCV, Beijing, China (2005) 10. El Ghoul, A., Jermyn, I.H., Zerubia, J.: A phase field higher-order active contour model of directed networks. In: NORDIA, in conjunction with ICCV, Kyoto, Japan (2009) 11. El Ghoul, A., Jermyn, I.H., Zerubia, J.: Segmentation of networks from VHR remote sensing images using a directed phase field HOAC model. In: ISPRS / PCV, Paris, France (2010) 12. El Ghoul, A., Jermyn, I.H., Zerubia, J.: Phase diagram of a long bar under a higherorder active contour energy: application to hydrographic network extraction from VHR satellite images. In: ICPR, Tampa, Florida (2008)
Appendix The functions μ, ν and G00 are 4 2 vm λ22 3vm λ21 1 + (φm + 1)(−3φm + 2) + (−3φm + 1) + μ=− 20 2 10 6 3 λ04 λ03 (φm + 1)2 (−9φ2m + 12φm + 1) + (φm + 1)2 (−φm + 1) + 60 6 1 + (λ22 + λ21 )(φm + 1)2 (11) 24 v4 v 2 λ22 2 λ04 2 ν= m+ m (φm − 1) + λ21 (φm − 1) − 1 + (φm − 1)2 4 2 2 4 λ03 1 (φm + 1)2 (φm − 2) − (λ22 + λ21 )(φm + 1)2 + (12) 3 8 +∞ wˆ w−x ˆ 2 2 G00 = 2 dz dx2 dt Ψ ( z 2 + t2 ) − Ψ ( z 2 + (w ˆ0 + t)2 ) (13) w ˆ 0 0 −x2
Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition Yan Zhu1 , Xu Zhao1 , Yun Fu2 , and Yuncai Liu1 1 2
Shanghai Jiao Tong University, Shanghai 200240, China Department of CSE, SUNY at Buffalo, NY 14260, USA
Abstract. By extracting local spatial-temporal features from videos, many recently proposed approaches for action recognition achieve promising performance. The Bag-of-Words (BoW) model is commonly used in the approaches to obtain the video level representations. However, BoW model roughly assigns each feature vector to its closest visual word, therefore inevitably causing nontrivial quantization errors and impairing further improvements on classification rates. To obtain a more accurate and discriminative representation, in this paper, we propose an approach for action recognition by encoding local 3D spatial-temporal gradient features within the sparse coding framework. In so doing, each local spatial-temporal feature is transformed to a linear combination of a few “atoms” in a trained dictionary. In addition, we also investigate the construction of the dictionary under the guidance of transfer learning. We collect a large set of diverse video clips of sport games and movies, from which a set of universal atoms composed of the dictionary are learned by an online learning strategy. We test our approach on KTH dataset and UCF sports dataset. Experimental results demonstrate that our approach outperforms the state-of-art techniques on KTH dataset and achieves the comparable performance on UCF sports dataset.
1
Introduction
Recognizing human actions is of great importance in various applications such as human-computer interaction, intelligent surveillance and automatic video annotation. However, the diverse inner-class variations of human poses, occlusions, viewpoints and other exterior environments in realistic scenarios, make it still a challenging problem for accurate action classification. Recently, approaches based on local spatial-temporal descriptors [1–3] have achieved promising performance. These approaches generally first detect or densely sample a set of spatial-temporal interest points from videos; then describe their spatial-temporal properties or local statistic characteristics within small cuboids centered at these interest points. To obtain global representations from sets of local features, the popular bag-of-words model [1, 4–6] is widely used incorporating with various local spatial-temporal descriptors. In the BoW model, an input video is viewed as an unordered collection of spatial-temporal words, each of which is quantized to their closest visual word existing in a trained R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 660–671, 2011. Springer-Verlag Berlin Heidelberg 2011
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
661
Fig. 1. Framework of our action recognition system. First the input video is transformed to a group of local spatial temporal features through dense sampling. Then the local features are encoded into sparse codes using the pre-trained dictionary. Finally, max pooling operation is applied over the whole sparse code set to obtain the final video representation.
dictionary by measuring a distance metric within the feature space. The video is finally represented as the histogram of the visual words occurrences. However, there is a drawback in its quantization strategy of the BoW model. Simply assigning each feature to its closest visual word can lead to relatively high reconstruction error; therefore the obtained approximation of the original feature is too coarse to be sufficiently discriminative for the following classification task. To address this problem, we propose an action recognition approach within sparse coding framework, which is illustrated in Fig. 1. Firstly we densely extract a set of local spatial-temporal descriptors with varying space and time scales from the input video. The descriptor we adopt is HOG3D [2]. We use sparse coding to encode each local descriptor into its corresponding sparse code according to the pre-trained dictionary. Then maximum pooling procedure is operated over the whole sparse code set of the video. The obtained feature vector, namely the final representation of the video, is the input of the classifier for action recognition. In the classification stage, we use multi-class linear support vector machine. Another issue addressed in this paper, is how to effectively model the distribution of the local spatial-temporal volumes over the feature space. To this end, constructing a well defined dictionary is a critical procedure in sparse coding framework. Generally a dictionary is built from a collection of training video clips with known labels similar to the test videos. However, a large set of labeled video data are required to get a well generalized dictionary. But it is a difficult task to conduct video annotation on such a large video database; and the generalization capability of the dictionary is degraded using homologous training and test data. In this work, motivated by previous works introducing transfer
662
Y. Zhu et al.
learning into image classification [7, 8], we construct our dictionary by utilizing large volumes of unlabeled video clips, from which the dictionary can learn universal prototypes to facilitate the classification task. These video clips are widely collected from movies and sport games. Although these unlabeled video sequences may not be closely relevant with the actions to be classified on the semantic level, they can be used to learn a more generalized latent structure of the feature space. Its efficacy is demonstrated by extensive comparative experiments. The remaining of the paper is organized as follows. Section 2 reviews the previous related works in action recognition and sparse coding. Section 3 introduces our proposed action recognition approach and the implementation details. Section 4 describes the experimental setup, results, and discussion. Section 5 briefly concludes this paper and addresses the future work.
2
Related Work
Many existing approaches of action recognition extract video representations from a set of detected interest points [9–11]. Diverse local spatial-temporal descriptors have been introduced to describe the properties of these points. Although compact and computational efficient, interest point based approaches mitigate much potentially useful information of the original data, and thus weaken the discriminative power of the representation. Recently two evaluations confirm that dense sampling method could achieve better performance in action recognition tasks [12, 13]. In our work, we adopt dense sampling method, but we use sparse coding instead of BoW model to obtain a novel representation of videos. Sparse representation has been widely discussed recently and achieved exciting progress in various fields including audio classification [14], image inpainting [15], segmentation [16] and object recognition [8, 17, 18]. These successes demonstrate that sparse representation could flexibly adapt to diverse low level natural signals with desirable properties. Besides, research in visual cortex has justified the biological plausibility of sparse coding [19]. For tasks of human action classification, interest information, inherently, is sparsely distributed in the spatial-temporal domain. Inspired by above insights, we introduce sparse representations into the field of action recognition from video. To improve the classification performance, transfer learning is incorporated into sparse coding framework to obtain robust representations and learn knowledge from unlabeled data [8, 17, 20]. Raina et al. [8] proposed self-taught learning to construct the sparse coding dictionary using unlabeled irrelevant data. Yang et al. [17] integrated sparse coding with traditional spatial pyramid matching method for object classification. Liu et al. [20] proposed a topographic subspace model for image classification and retrieval employing inductive transfer learning. Motivated by above successes, our work explores the application of sparse coding with transfer learning in video domain. Experimental results will validate that dictionary training with transferable knowledge from unlabeled data can significantly improve the performance of action recognition.
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
3 3.1
663
Approach Local Descriptor and Sampling Strategy
Each video can be viewed as a 3D spatial-temporal volume. In order to capture sufficient discriminative information, we choose to densely sample a set of local features across space-time domain throughout the input video volume. Each descriptor is computed within a small 3D cuboid centered at a space-time point. Multiple sizes of 3D cuboid are adopted to increase scale and speed invariance. We follow the dense sampling parameter settings as described in [13]. To capture local motion and appearance characteristic, we use HOG3D descriptor proposed in [2]. The descriptor computation can be summarized as following steps: 1. Smooth the input video volume to obtain the integral video representation. 2. Divide the sampled 3D cuboid centered at p = (xp , yp , tp ) into nx × ny × nt cells and further divide each cell into Sb × Sb × Sb subblocks. ¯ bi = (¯ 3. For each subblock bi , compute the mean gradient g gbi ∂x , g¯bi ∂y , g¯bi ∂t ) using the integral video. ¯ bi as qbi using a regular icosahedron. 4. Quantize each mean gradient g 5. For each cell cj , compute the histogram hcj of the quantized mean gradients over Sb × Sb × Sb subblocks. 6. Concatenate the nx × ny × nt histograms to one feature vector. After 2 normalization, the obtained vector xp is the HOG3D descriptor for p. In Section 4, parameter settings for sampling and descriptor computation will be discussed in detail. 3.2
Sparse Coding Scheme for Action Recognition
In our action recognition framework, sparse coding is used to obtain a more discriminative intermediate representation for human actions. Suppose we have obtained a set of local spatial-temporal features X = [x1 , ..., xN ] ∈ Rd×N to represent a video, where each feature is a d-dimensional column vector; and we have a well-trained dictionary D = [d1 , ..., dS ] ∈ Rd×S . Sparse coding method [8, 17, 18] manages to sparsely encode each feature vector in X into a linear combination of a few atoms of dictionary D by optimizing = arg min 1 X − DZ2 + λZ1 , Z 2 Z∈RS×N 2
(1)
where λ is a regularization parameter which determines the sparsity of the representation of each local spatial-temporal feature. Dictionary D is pre-trained to be an overcomplete basis set which is composed of S atoms, in which each atom is a d-dimensional column vector. Note that S typically is greater than 2d. To avoid numerical instability, each column of D subjects to the constraint of dk 2 ≤ 1. Once the dictionary D is fixed, the optimization over Z alone is convex, thus can be viewed as an 1 regularized linear least square problem. The twofold optimization goal ensures the least reconstruction error and the
664
Y. Zhu et al.
simultaneously. We use the LARS-lasso implesparsity of the coefficients set Z mentation provided by [15] to find the optimal solution. After optimization, we = [ get a set of sparse codes Z z1 , ..., zN ] , where each column vector zi has only a few nonzero elements. It can also be interpreted as that each descriptor only responds to a small subset of dictionary D. To capture the global statistics of the whole video, we use a maximum pooling function [17, 18] defined as (2) β = ξmax (Z), where ξmax returns a vector β ∈ RS with the to pool over the sparse code set Z, k-th element defined as k1 |, |Z k2 |, ..., |ZkN |} . βk = max {|Z
(3)
Through maximum pooling, the obtained vector β is viewed as the final video level feature. Maximum pooling operation has been successfully used in several image classification frameworks to increase spatial translation invariance [17, 18, 21]. Such mechanism has been proven to be consistent with the properties of the cells in visual cortex [21]. Motivated by above insights, we adopt the similar procedure to increase both spatial and temporal translation invariance. Within the sparse code set, only the strongest response for each particular atom is preserved without considering its spatial and temporal location. Experimental results demonstrate that the maximum pooling procedure can lead to compact and discriminative final representation of the videos. Note that our previous choice of dense sampling strategy provides sufficient low-level local features, which guarantee the maximum pooling procedure statistically reliable. 3.3
Dictionary Construction Based on Transfer Learning
Under sparse coding framework, constructing a dictionary for a specific classification task is essentially to learn a set of overcomplete bases to represent the basic pattern of the specific data distribution within feature space. Given a large collection of local descriptors Y = [y1 , ..., yM ], the dictionary learning process can be interpreted as jointly optimizing with respect to the dictionary D and coefficients set Z = [z1 , ..., zM ], arg min Z∈RS×M ,D∈C
M 1 1 yi − Dzi 22 + λzi 1 , M i=1 2
(4)
where C is defined as a convex set C {D ∈ Rd×S s.t. dk 2 ≤ 1, ∀k ∈ {1, ..., S}}. In dictionary training stage, the optimization is not convex when both dictionary D and coefficients set Z are varying. How to solve the above optimization, especially in situations of large training set, is still an open problem. Recently, Mairal et al. [15] presented an online dictionary learning algorithm, which is much faster than previous methods and proven to be more suitable for large training sets with action recognition purpose. Due to the desirable properties mentioned above, we use this technique to train our dictionary.
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
665
To discover the latent structure of the feature space for actions, we use a large set of unlabeled video clips collected from movies and sport games as the “learning material”, which is different from [22]. In recent years, transfer learning from unlabeled data to facilitate the supervised classification task has sparked many successful applications in machine learning [8, 20]. Although these unlabeled video clips are not necessarily belong to the same classes with the test data, they are similar in that they all contain human motions sharing the universal patterns. This transferable knowledge is helpful for our supervised action classification. In our experiment, we will construct two dictionaries in different ways. The first one is trained with patches from target classification videos and the second one is trained with unlabeled video clips. Experimental results will show that the second dictionary yields higher recognition rate. This can be explained that the unlabeled data contain more diverse patterns of human actions, which are helpful for the dictionary to thoroughly discover the nature of human action. In contrast, dictionary constructed merely from training data can hardly grasp the universal basis because of the insufficient information provided by training data, which made the dictionary unable to encode the test data sparsely and accurately. It is interesting to observe that the dictionary constructed from unlabeled data can potentially be utilized in other relevant action classification tasks. The transferable knowledge makes the dictionary more universal and reusable. 3.4
Multi-class Linear SVM
In classification stage, we use multi-class support vector machine with linear kernels as described in [17]. Given a training video set {(βi , yi )}ni=1 for an Lclass classification task, where βi denotes the feature vector of the i-th video and yi ∈ Y = {1, ..., L} denotes the class label of βi , we adopt one-against-all method to train L linear SVMs, each of which seeks to learn L linear functions {Wc β|c ∈ Y}. The trained SVMs predict the class label yj for a test video feature βj by solving yj = arg max Wc βj . (5) c∈Y
Note that the traditional histogram based action recognition models usually need some specific designed nonlinear kernels, which lead to time-consuming computations for classifier training. However, the linear kernel SVM operated on sparse coding statistics can achieve satisfying accuracy with much faster speed.
4
Experiments
We evaluate our approach on two benchmark human action datasets: the KTH action dataset [5] and the UCF Sports dataset [23]. Some sample frames from the two databases are shown in Fig. 2. In addition, we evaluate the different effects of two dictionary construction manners.
666
Y. Zhu et al.
Fig. 2. Sample frames from KTH dataset (top row) and UCF sports dataset (middle and bottom rows)
4.1
Parameter Settings
Sampling and descriptor parameters. In the sampling stage, we extract 3D patches of varying sizes from the test video. The minimum size of 3D patches is 18 pixels× 18 pixels × 10 frames. We employ the sampling setting as suggested √ √ √ 2, 36, 36 2, 72, 72 2, 144), in [13], specifically, using eight spatial scales(18, 18 √ and two temporal scales (10, 10 2). Patches with all possible combinations of temporal scales and spatial scales are densely exacted with 50% overlap. Note that we discard those patches whose spatial scales exceed the video resolution, e.g. 144 pixels × 144 pixels for KTH frames of 160 × 120 resolution. We calculate HOG3D features using the executable provided by the authors of [2] with default parameters, namely, number of supporting subblocks S 3 = 27 and number of histogram cells nx = ny = 4, nt = 3, thus each patch corresponds to a 960dimensional feature vector. Dictionary training parameters. As for dictionary construction, we set the dictionary size as 4000 empirically. In dictionary learning stage, we extract 400000 HOG3D features from 500 video clips collected from movies and sport games. In selecting dictionary training video clips, we do not impose any semantic constraints on the contents of the video clips except one criterion: all video clips must contain at least one moving subject. 1 regularization parameter λ is , as suggested in [15], where m = 960, denoting the dimension of the set to √1.2 m original signal. 4.2
Performance on KTH Dataset
The KTH dataset is a benchmark dataset to evaluate various human action recognition algorithms. It consists of six types of human actions including
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
667
walking, jogging, running, boxing, hand waving and hand clapping. Each action is performed several times by 25 subjects under four different environment settings: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. Currently KTH dataset contains 599 video clips in total. We follow the common experimental setup as [5], randomly dividing all the sequences into the training set (16 subjects) and the test set (9 subjects). We train a multiclass support vector machine using one-against-all method. The experiment is repeated 100 times and the average accuracy over all classes is reported. Table 1. Comparisons to previous published results on KTH dataset Methods
Average Precision
Experimental Setting
Niebles[4] Jhuang[24] Fathi [25] Laptev[1] Bregonzio [26] Kovashka [6] Our Method
81.50% 91.70% 90.50% 91.80% 93.17% 94.53% 94.92%
leave-one-out split split split leave-one-out split split
Table 1 shows that our proposed method achieves an average accuracy of 94.92%, which outperforms previous published results. Note that some of the above results [4, 26] are implemented under leave-one-out settings. Confusion matrix shown in Fig. 3 demonstrates that all the classes are predicted with satisfying precision except the pair of jogging and running. This is understandable since these two sorts of actions look ambiguous even for human beings.
Fig. 3. Confusion matrix for the KTH dataset, the rows are the real labels while the columns are predicted ones. All the reported results are the averages of 100 rounds.
668
4.3
Y. Zhu et al.
Performance on the UCF Sports Dataset
UCF sports dataset consists of 150 video clips belonging to 10 action classes including diving, golf swinging, kicking, lifting, horse riding, running, skating, swinging (around high bars), swinging (on the floor or on the pommel). All the video clips are collected from realistic sports broadcasts. Following [13, 23], we enlarge the dataset by horizontally flipping all the original clips. Similar to the setting in the KTH dataset, we train a multi-class SVM using one-against-all setting. To fairly compare with previous results, we also employ leave-one-out manner, specifically, test each original video clips successively while training all of the remaining clips. The flipped version of the test video clip is excluded from the training set. We report the average accuracy over all classes. Experimental results demonstrate that our method achieves comparable accuracy with state-of-art techniques, as shown in Table 2. Fig. 4 shows the confusion matrix for the UCF dataset. Note that UCF Sports dataset used by the authors listed in Table 2 differs slightly since some of the videos are removed from the original version due to copyright issues. 4.4
Dictionary Construction Analysis
We also explore different methods for dictionary construction on two datasets and evaluate the corresponding effects on performance. We train two dictionaries from Table 2. Comparisons to previous published results on UCF sports dataset Methods
Average Precision
Experimental Setting
Rodriguez[23] Yeffet[27] Wang [13] Kovashka[6] Our Method
69.20% 79.30% 85.60% 87.27% 84.33%
leave-one-out leave-one-out leave-one-out leave-one-out leave-one-out
Fig. 4. Confusion matrix of UCF dataset
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
669
Table 3. Comparison of different dictionary training sources on KTH dataset Dictionary Source Average Sparsity
Standard Deviation
Classification Accuracy
Training Set Diverse Source
0.20% 0.09%
91.94% 94.92%
0.65% 0.54%
Table 4. Comparison of different dictionary training sources on UCF Sports dataset Dictionary Source Average Sparsity
Standard Deviation
Classification Accuracy
Training Set Diverse Source
0.09% 0.06%
82.09% 84.33%
0.61% 0.50%
different sources for each dataset. The first one is trained on patches collected from the training set of the KTH dataset or UCF Sports dataset while the other one is trained with patches extracted from diverse sources including movies and sports videos. All the parameter settings for training are the same. Results show that the dictionary trained with diverse sources yields higher accuracy than the one trained merely using training sets. To further investigate the effect of dictionary sources, we calculate the sparsity of the sparse codes transformed from the original features. For each sparse code vector z ∈ RS , we define the sparsity of z as sparsity(z) =
z0 , S
(6)
where zero-norm z0 denotes the number of nonzero elements in z. We calculate the average sparsity and the corresponding standard deviation of the whole sparse code set, as shown in Table 3 and Table 4. We use these two statistics to measure how suitable a dictionary is for the given dataset. We find that the diverse source dictionary gets lower average sparsity and its corresponding standard deviation is also lower than using training set. Low sparsity ensures the more compact representations and low standard deviation over the whole dataset manifests that the diverse source dictionary is more universal for different local motion patterns. In contrast, the dictionary obtained from training set appears less sparse and more inclines to fluctuate. Although it can encode certain local features into extremely sparse form in some cases (probably due to overfitting for certain patterns), the overall sparsity of the whole sparse code set is still not comparable. This can be explained that the distribution disparity between training videos and test videos makes the dictionary learned merely from training set neither to precisely model the target subspace, nor to further impair the discriminative power of the final video representation. In contrast, the diverse source dictionary captures common patterns from diversely distributed data. It generates a set of bases with higher robustness, and thus better models the subspace which is more generative to correlate with the target data distribution. Table 3 and Table 4 can help demonstrate this point.
670
5
Y. Zhu et al.
Conclusion
In this paper, we have proposed an action recognition approach using sparse coding and the local spatial-temporal descriptors. To obtain high-level video representations from local features, we also suggest to use transfer learning and maximum pooling procedure rather than the traditional histogram representation of BoW model. Experimental results demonstrate that sparse coding can provide a more accurate representation with desirable sparsity, which strengthens the discriminative power and eventually help improve the recognition accuracy. In the future work, we would like to investigate the inner structure of the dictionary and the mutual relationship of different atoms, which will be helpful to explore the semantic representations. Besides, the supervised dictionary learning algorithm will also be our future research interest. Acknowledgements. This work is supported by the 973 Key Basic Research Program of China (2011CB302203), NSFC Program under (60833009, 60975012) and the SUNY Buffalo Faculty Startup Funding.
References 1. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE CVPR (2008) 2. Kl¨ aser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3Dgradients. In: British Machine Vision Conference (2008) 3. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to action recognition. In: ACM Multimedia, pp. 357–360 (2007) 4. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79, 299–318 (2008) 5. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: ICPR, pp. 32–36 (2004) 6. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE CVPR (2010) 7. Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, 1817–1853 (2005) 8. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.: Self-taught learning: transfer learning from unlabeled data. In: International Conference on Machine Learning, pp. 759–766. ACM, New York (2007) 9. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005) 10. Laptev, I.: On space-time interest points. IJCV 64, 107–123 (2005) 11. Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: IEEE ICCV (2007) 12. Dikmen, M., Lin, D., Del Pozo, A., Cao, L., Fu, Y., Huang, T.S.: A study on sampling strategies in space-time domain for recognition applications. In: Advances in Multimedia Modeling, pp. 465–476 (2010)
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
671
13. Wang, H., Ullah, M., Kl¨ aser, A., Laptev, I., Schmid, C.: Evaluation of local spatiotemporal features for action recognition. In: British Machine Vision Conference (2009) 14. Grosse, R., Raina, R., Kwong, H., Ng, A.Y.: Shift-invariant sparse coding for audio classification. In: UAI (2007) 15. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: International Conference on Machine Learning, pp. 689–696. ACM, New York (2009) 16. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Discriminative learned dictionaries for local image analysis. In: IEEE CVPR (2008) 17. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE CVPR (2009) 18. Yang, J., Yu, K., Huang, T.S.: Supervised translation-invariant sparse coding. In: IEEE CVPR (2010) 19. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research 37, 3311–3325 (1997) 20. Liu, Y., Cheng, J., Xu, C., Lu, H.: Building topographic subspace model with transfer learning for sparse representation. Neurocomputing 73, 1662–1668 (2010) 21. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE T-PAMI 29, 411–426 (2007) 22. Taylor, G., Bregler, C.: Learning local spatio-temporal features for activity recognition. In: Snowbird Learning Workshop (2010) 23. Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE CVPR (2008) 24. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE ICCV (2007) 25. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: IEEE CVPR (2008) 26. Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points. In: IEEE CVPR (2009) 27. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: IEEE ICCV (2009)
Multi-illumination Face Recognition from a Single Training Image per Person with Sparse Representation Die Hu, Li Song, and Cheng Zhi Institute of Image Communication and Information Processing, Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China {butterflybubble,song li,zhicheng}@sjtu.edu.cn
Abstract. In real-world face recognition systems, traditional face recognition algorithms often fail in the case of insufficient training samples. Recently, the face recognition algorithms of sparse representation have achieved promising results even in the presence of corruption or occlusion. However a large over-complete and elaborately designed discriminant training set is still required to form sparse representation, which seems impractical in the single training image per person problems. In this paper, we extend Sparse Representation Classification (SRC) to the one sample per person problem. We address this problem under variant lighting conditions by introducing relighting methods to generate virtual faces. Our diverse and complete training set can be well composed, which makes SRC more general. Moreover, we verify the recognition under different lighting environments by a cross-database comparison.
1
Introduction
In the past decades, the one sample per person problem has aroused significant attention in Face Recognition Technology (FRT) due to the wide applications such as law scenarios and passport identifications. It is defined as: given a stored database of faces with only one image per person, the goal is to identify a person from the database later in time in any different and unpredictable poses, lighting, etc from an individual image [1]. However, when there is only one sample per person available, many difficulties, i.e. reliability of parameter estimation, may arise when directly putting traditional recognition methods into use. Recently, many research efforts have been focusing on applying the sparse representation to computer vision tasks since sparse signal representation has proven to be an extremely powerful tool for acquiring, representing, and compressing high-dimensional signals [2]. In FRT, Wright et al. [3] has cast face recognition as the problem of finding the sparsest representation of the test image over the training set, which have demonstrated promising results even in the presence of corruption and occlusion. However they have to provide a large over-complete and elaborately designed discriminant training set to form sparse R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 672–682, 2011. Springer-Verlag Berlin Heidelberg 2011
Multi-illumination Face Recognition from a Single Training Image with SRC
673
representation. It’s beyond doubt that this constraint provokes a great challenge in one sample per person problem. Details will be discussed in Section 3. In this paper, we propose an improved method of SRC in the one sample per person recognition problem under different lighting environments. Different from the traditional face recognition methods focused on illumination compensation, we make full use of the diversity representative illumination information to generate novel realistic faces. Given a single image input, we could synthesize the image space of it, so that the test image could form an efficient sparse representation when projected onto the constructed space. By doing that, we could not only satisfy the over-complete constraint in one sample per person problem, but also guarantee the discrimination capability of the training set in terms of selecting varying illumination conditions. We conduct extensive experiments on public databases, and especially, verify the recognition under different lighting environments by a cross-database comparison, of which the classical illumination record doesn’t come from the same database as the images to be recognized.‘
2 2.1
Preliminaries SRC
As is mentioned above, in FRT, [3] has cast the face recognition problem as finding the sparsest representation of the test image over the whole training set. The well-aligned ni training images of individual i taken under varying illumination are stacked as columns of a matrix Ai = [ai,1 , ai,2 , ..., ai,n ]∈Rm×ni , each normalized to be the l2 norm. Then we could define a dictionary as a concentration of all the N object classes A = [A1 , A2 , ..., AN ]∈Rm×n . Here, n is N the number of total training images and n = i=1 ni . Since images of the same face under varying illumination lie near a special low dimensional subspace [4], given a test image y∈Rm , it can be represented by a sparse linear combination of the training set y = Ax, where x is a coefficient vector whose items are all zero except for those associated with the ith object. However in consideration of the gross error brought by partial corruption or occlusion, the formulation is y = Ax + e. Since the desired solution (x, e) is the not only sparse but also the sparsest solution of the system, it can be recovered by the following equation (x, e) = arg min x0 + e0
s.t. y = Ax + e
(1)
Here, the l0 norm counts the nonzero number in a vector. However, to solve the l0 norm problem above is NP-hard. Based on the theoretical result of l0 and l1 minimization equivalence, if the solution (x, e) sought is sparse enough, the authors would rather seek the convex relaxation min x1 + e1
s.t. y = Ax + e
(2)
The above problem can be solved by linear programming methods in polynomial time efficiently. In succession, to analyze the coefficient to what extent it concentrates on one subject, we can judge which class it belongs to and whether it belongs to any subjects in the training database.
674
2.2
D. Hu, L. Song, and C. Zhi
Quotient Image
Quotient Image (QI) [5] is firstly introduced by Riklin-Raviv and Shashua, which uses the color ratio to define an illumination invariant signature image and enables a rendering of the image space with varying illumination. The main algorithm of Quotient Image is formulated on the Lambertian model I(x, y) = ρ(x, y)n(x, y)T s
(3)
Where ρ is the albedo (surface texture) of face, n(x, y)T is the surface normal of the object, and s is the point light source direction which can be arbitrary. Then the quotient image Qy of an object y against the object a is defined by Qy (u, v) =
ρy (u, v)n(u, v)T sy ρy (u, v) = ρa (u, v) ρa (u, v)n(u, v)T sy
(4)
Let s1 , s2 , s3 be the three linearly independent vectors of a basis, thus any point light source direction can be reconstructed by sy = j xj sj . Hence Qy (u, v) =
ρy (u, v)n(u, v)T sy Iy = 3 ρa (u, v)n(u, v)T j xj sj j=1 Ij xj
(5)
Here, Iy is the image of illumination source direction sy . If the illumination set {Ij }3j=1 in (5) is devised in advance, given a reference image of object y, the quotient image Qy could be computed. Then by product of Qy and the different choice of Ij , the image space of a reference object y could be synthesized by the set of three linearly independent illumination images of other objects. Through the above method, given a single input image of an object, and a database of images with varying illumination of other objects of the same general class, we can re-render the input image to simulate new illumination conditions using QI.
3 3.1
SRC for One Training Image per Person Motivation for Synthesizing Virtual Samples
To extend SRC method to the one sample image per person problem, the first difficulty to handle is that we must provide over-complete training atoms. One of the ways is synthesizing virtual samples from a single image. Furthermore, to give a better performance of SRC, the dictionary must be discriminant. We test the algorithm of [3] under the following conditions. The Extended Yale B, one of the renowned face databases, consists of 2,414 frontal faces of 38 individuals [4]. We choose the 30th to the 59th images in Extended Yale B of each individual for training, and the 1st to the 29th to test, then down-sample all the images to 12 × 10 at a rate of 1/16, which is the same feature dimension as in [3]. The recognition rate is about 82.3%, and it is much lower than the results of randomly selecting half of the images for training, which was reported in [3] at a rate between 92.1% and 95.6%. This is because of the uniform distribution of
Multi-illumination Face Recognition from a Single Training Image with SRC
675
(1)
(2)
(3)
(4)
(5)
(6) (a)
(b)
Fig. 1. Left: Six groups of training set in our experiment in rows. In the top Row (1), the first 10 consecutive images are selected. In Row (2) we choose every other image, which is the 1st, 3rd, 5th images and so on. Then in Row (6), we select one in every six consecutive images. Right: The corresponding recognition results to the Left.
the training samples. Since all the images in this training set are collected under 64 different illuminations, the lighting condition varies to the consecutive one gradually. To discuss how to design the training set for robustness, we conduct experiments as follows. For each individual, we select 10 images for training, with the others left for testing. We divide our training set into six groups. For the first group, the top 10 consecutive images are selected as the training dictionary. For the second set, we select every other images. And for the third, we select one in every three consecutive pictures and so on. As shown in Figure 1(a), there are six groups of images in rows. And the recognition results are in Figure 1(b) respectively. It can be deduced that, the more diversely illumination varies from each other, the better recognition results it gives. In our method, we make full use of illumination environment information of an illumination record instead of traditional compensation ways. We can generate complete and diversity training atoms from a single input image of an individual by devising a small representative illumination record. Therefore, synthesizing virtual samples could not only satisfy the sparse constraint but also form a discriminant dictionary for the one sample per person problem. 3.2
Our Approach of Recognition with One Training Image per Person
Suppose that there is only one image ysi (the reference image) available for each individual i, firstly we should generate the sufficient and robust training set for each object. Based on the technology of QI, given a database of varying illumination, we could generate new images of another subject of the same condition. Let Bj be a matrix whose columns are the pictures of object j with albedo function ρj . The illumination set {Bj }K j=1 , which contains only K (a small number) other objects, can be elaborately designed manually to ensure the discrimination
676
D. Hu, L. Song, and C. Zhi
capability. From (5), if we know the correct coefficient xj , we can get the quotient image Qyi . In [5], it proves that the energy function K
f (ˆ x) =
1 |Bj x ˆ − αj ysi |2 2 j=1
(6)
has a global minimum x ˆ = x, if the albedo ρy of an object y is rationally spanned by the illumination set. To recover x in (6), it is only a least-squares problem. x) is: The global minima x0 of the energy function f (ˆ x0 =
K
αj vj
(7)
j=1
where K vj = ( Br BrT )−1 Bj ysi
(8)
r=1
and the coefficients αj are determined up to a uniform scale as the solution of the following equation: K αj ysTi ysi − ( αr vr )T Bj ysi = 0 s.t. r=1
αj = K
(9)
j
for j = 1, . . . , K. Then the quotient image Qyi (defined in (4)) is computed by Qyi =
ysi Bx
(10)
where B=
K 1 Bi K i=1
(11)
is the average of illumination set. Then we can approximate which class the given test sample yt belongs to through SRC. For z∈Rn , δi (z)∈Rn is a new vector of z whose entries are zero except for those associated with the ith class. One can approximate the given test sample by yˆi = Aδi (ˆ z ). We then classify yt based on these approximations by assigning it to the object class that minimizes the residual between yt and yˆi : ri (yt ) = yt − Aδi (ˆ z )2
(12)
The whole procedures of our method are depicted in the flowchart of Figure 2. And Algorithm 1 summarizes the complete recognition details in one sample per person problem.
Multi-illumination Face Recognition from a Single Training Image with SRC
677
Ill. Set design
Reference image
Test image
Feature extraction
QI generated
Sparse representation
Result
Feature extraction
Fig. 2. The whole recognition procedures Algorithm 1. Single Face Recognition via Sparse Representation 1: Input: {Bj }K j=1 , the illumination set, where each matrix contains ni images (as its columns). ysi , the reference image of each individual i (a vector of size m). yt , the test image. Align all the images in the Rm space. 2: Step: Solve the equation (9). Compute x in (7) and the quotient image Qyi is generated by (10) and (11). Synthesize the training space: For l = 1, . . . , ni and i = 1, . . . , N , Ai,l = Qyi ⊗ Bl , where Ai,l stands for the lth image of the ith person, Bl is the lth column of matrix B, and ⊗ is the Cartesian product (pixel by pixel multiplication). 3: Step: Stack all the generated matrices in A = [A1 , A2 , . . . , AN ]∈Rm×n , and normalize the columns of A to have the unit l2 norm. Solve the l1 minimization problem (13) zˆ = arg min z1 s.t. Az − yt 2 ε z
where ε is an optional error tolerance. Compute the residuals (12) for i = 1, . . . , N . 4: Output: identity(yt ) = arg mini ri (yt ).
4
Experiments
In this section, we conduct a wide range of experiments on publicly available databases for face recognition. Firstly, we verify our algorithm on CMU-PIE database [6] and discuss its robustness in real-world applications. We then extend our method on color images. Finally, we put forward a cross-database synthesizing comparison. 4.1
Gray Scale Images Verification
Firstly we test our algorithm in the CMU-PIE database. It consists of 2,924 frontal-face images of 68 individuals. The cropped and normalized 112 × 92 face images were captured under various laboratory-controlled lighting conditions. There are two folders of different illuminations, of which the folder with the
678
D. Hu, L. Song, and C. Zhi
ambient lights off we call it PIE-illumination package while the folder with the ambient lights on is named the PIE-light package. Each person is under 43 different illumination conditions, and the PIE-illumination contains 21 lighting environments. We randomly choose images of 10 individuals as the illumination set in PIElight. Since there are 22 distinct illumination, we synthesize images of the same lighting condition given a reference image of the left persons. As is shown in Figure 3 there are 22 different images generated. Here we select the first image of each face as the unique reference image to get quotient image and re-render others while the images left are all used for testing. In the period of recognition, the feature space dimension was set to 154. Since the precise choice of feature space is no longer critical [3], we just down-sample the images at a ratio of 1/8. Our method achieves recognition rate of 95.07% which gives a good result in the single training image problem.
Fig. 3. Virtual samples
One may argue that the reference image we chose was well-chosen, since the first image of each object was under a favorable illumination condition. However, in real-world tasks, especially in law enforcement scenarios for illustration, we often have the record of each suspect with a piece of legible photograph. Therefore this choice of deliberately designed input image looks reasonable. Then we verify the algorithm of different input images as depicted in Figure 4. Here the leave-one-out scheme [7] is used. To name a few, each image acts as the template by turns and the others present as the test part. As is shown in the picture, the recognition result is influenced by the template we choose, and the recognition rate fluctuates between 82.35% and 95.07%. Here templates which give a most or least satisfying performance is presented as below. In Figure 5, we can see that images in column (b) and column (c), which perform worst at a rate around 82.35%, are images with shadows. This is due to the limitation of the formulation of Lambertian model, without taking into account shadow. It can be expected that other algorithms such as 3D morphable model will improve the performance. 4.2
Recognition with Color Images
The next experiment is about color images. Given a color image represented by RGB channels, we could directly convert it to a gray scale image. The gray scale
Multi-illumination Face Recognition from a Single Training Image with SRC
679
Recognitio Rate
1 0.95 0.9 0.85 0.8 0.75 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Different Test Set
Fig. 4. Recognition Rate on CMU-PIE
(a)
(b)
(c)
Fig. 5. Templates. The Column (a) has the most willing illumination, which achieved a recognition result at 95.07%. And the (b) and (c) columns perform worst at a rate of 82.34%.
images conserve all the lightness information because lightness of the gray is directly proportional to the number representing the brightness levels of the primary. We can assume that the varying illumination only works on the gray-value distribution without affecting the hue and saturation part. At the same time, the value channel of HSV representation, which describes the luminance (the blackand-white) information, can be transformed from the RGB color space. As is shown in Figure 6, Row (a) contains four color images with different illumination directions. And Row (b) are images from the RGB color space transforms into HSV space. While in the Row (c), we present the value in the V channel. It reveals that the Value channel in HSV space preserves the luminance information.
680
D. Hu, L. Song, and C. Zhi
(a)
(b)
(c)
Fig. 6. The V-channel in HSV color space. There are four illumination conditions in columns. Row (a) contains four color images in RGB space. And Row (b) are images from the RGB color space transforms into HSV space. While in the Row (c), we present the value in the V channel.
To illustrate how it works in recognition, we use the AR databases. It is made up of over 4,000 frontal images of 126 individuals. For each person, 26 pictures are taken in two separate sessions named Session 1 and Session 2, of which 8 images are faces with only illumination change [8]. In our trail, we use the 4 illumination conditions of arbitrary 10 individuals in Session 1 as the illumination set. Then we randomly select one of the 8 images of the other individuals for training, and process them as described in Figure 7. Subsequently, we recognize the left faces in the 8 images of each individual after down-sampling all the matrix to 154 features. It achieves a recognition rate between 78.75% and 81.48% though we have only 4 training lighting conditions of each person. 4.3
A Cross-Database Comparison
So far we have experimented with objects and their images from the same databases. Even though the object we choose for training and testing is outside the illumination set, we still have the advantage that they are taken by the same camera, in the same laboratory-controlled environment. Here we test our algorithm under cross-dictionary conditions especially. To name a few, in our
Multi-illumination Face Recognition from a Single Training Image with SRC
681
H
The reference image
HSV
HSV
S
V
QI Generated
RGB
Grayscale
V
Fig. 7. Color images synthesized procedures Table 1. Cross database performance Ill. set Extended Extended Extended Extended
No. of Ills. Reference image for training Rec. rate Yale Yale Yale Yale
B B B B
30 20 30 20
20th face of PIE-illumination 20th face of PIE-illumination 1st face of PIE-light 1st face of PIE-light
76.68% 72.41% 98.67% 98.18%
experiment, a cross-database test means that the classical illumination record in the set doesn’t come from the same database as images to be recognized. However, in real-world scenarios, we often have possessions of illumination set in advance, but the images for training and identification are usually from a totally different collection environment. As is shown in Table 1, in attempt to generate an incoherent dictionary [9] in the language of signal representation, we select every other image of random 10 persons in Extended Yale B as the illumination set. Then all the faces in PIEillumination are taken for testing. The 20th face of each man is the reference image to generate training atoms and the others are test cases. In the beginning, we pre-align all the images in both databases to 112 × 92 pixels and the downsampling rate is also 1/8. That is, the feature space we use is 154. As a result, the recognition rate is 76.68%. Whereafter, we decrease the illumination conditions to 20 instead, and the rate is around 72.41% respectively. Finally, images in PIElight are also used for testing instead of PIE-illumination in the same procedures, and the corresponding performance is recorded in the table. We can get several views from the table as follows. First of all, we can use our method for one sample per person problem in real-world scenarios when the images in the illumination set are not collected in the same environment as the images to be classified. In addition, illumination diversity of the training set improves the performance. Last but not the least, it implies some tricks of choosing the reference image, since it’s obvious that images in PIE-light perform much better than those in PIE-illumination. There may be some reasons related to the ambient light so that illumination distribution is more uniform. However, we can get the environment condition of the reference image in control and make some illumination compensation of it in practice.
682
5
D. Hu, L. Song, and C. Zhi
Conclusions
In this paper, we present a new method for one sample per person recognition based on sparse representation, which makes use of the illumination diversity information instead of compensation pre-processing. This method could not only satisfy the over-complete constraint in the one training image per person problem, but also make the training set discriminant. Our experiments on public databases achieved good performances and a cross-database comparison was put forward. In the future, we will try to improve the virtual sample generation method with other techniques by taking pose variation into consideration. At the same time, details of dictionary design will be explored further as well. Acknowledgement. This work was supported in part by NSFC (No.60702044, No.60632040, No.60625103), MIIT of China (No.2010ZX03004-003) and 973 Program of China (No.2010CB731401, No.2010CB731406).
References 1. Tan, X.: Face recognition from a single image per person: A survey. Pattern Recognition 39, 1725–1745 (2006) 2. Wright, J.: Sparse representations for computer vision and pattern recognition. Proceedings of IEEE (2009) 3. Wright, J.: Robust face recognition via sparse representation. IEEE Trans. (2008) 4. Geoghiades, A.: From few to many: Illumination Cone models for face recognition under variable lighting and pose. IEEE Trans. on Pattern Analysis and Machine Intelligence 23, 643–660 (2001) 5. Shashua, A.: The quotient image: Class-based re-rendering and recognition with varying illuminations. Transactions on Pattern Analysis and Machine Intelligence 23, 129–139 (2001) 6. Sim, T.: The CMU Pose Illumination and Expression (PIE) Database. In: The 5th International Conference on Automatic Face and Gesture Recognition (2002) 7. Wang, H.: Face Recognition under Varying Lighting Condition Using Self Quotient Image. In: Proceedings of International Conference on Automatic Face and Gesture Recognition, vol. 54, pp. 819–824 (2004) 8. Martinez, A.: The AR face database. CVC Tech. Report (1998) 9. Donoho, D.: For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. In: IEEE International Symposium on Comm. Pure and Applied Math., Information Theory, ISIT 2007, vol. 59, pp. 797–829 (2006)
Human Detection in Video over Large Viewpoint Changes Genquan Duan1 , Haizhou Ai1 , and Shihong Lao2 1
Computer Science & Technology Department, Tsinghua University, Beijing, China
[email protected] 2 Core Technology Center, Omron Corporation, Kyoto, Japan
[email protected]
Abstract. In this paper, we aim to detect human in video over large viewpoint changes which is very challenging due to the diversity of human appearance and motion from a wide spread of viewpoint domain compared with a common frontal viewpoint. We propose 1) a new feature called Intra-frame and Inter-frame Comparison Feature to combine both appearance and motion information, 2) an Enhanced Multiple Clusters Boost algorithm to co-cluster the samples of various viewpoints and discriminative features automatically and 3) a Multiple Video Sampling strategy to make the approach robust to human motion and frame rate changes. Due to the large amount of samples and features, we propose a two-stage tree structure detector, using only appearance in the 1st stage and both appearance and motion in the 2nd stage. Our approach is evaluated on some challenging Real-world scenes, PETS2007 dataset, ETHZ dataset and our own collected videos, which demonstrate the effectiveness and efficiency of our approach.
1
Introduction
Human detection and tracking are intensively researched in computer vision in recent years due to their wide spread of potential applications in real-world tasks such as driver-aided system, visual surveillance system, etc., in which real time and high accuracy performance is required. In this paper, our problem is to detect human in video over large viewpoint changes as shown in Fig. 1. It is very challenging due to the following issues: 1) A large variation in human appearance over wide viewpoint changes is caused by different poses, positions and views; 2) Lighting change and background clutter as well as occlusions make human detection much harder; 3) The diversity of human motion exists where any direction of movement is possible and the shape and the size of a human in video may change as he moves; 4) Video frame rates are often different; 5) The camera is moving sometimes. The basic idea for human detection in video is to combine both appearance and motion information. There are mainly three difficult problems to be explored in this detection task: 1) How to combine both appearance and motion information to generate discriminative features? 2) How to train a detector to cover such R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 683–696, 2011. Springer-Verlag Berlin Heidelberg 2011
684
G. Duan, H. Ai, and S. Lao
Fig. 1. Human detection in video over large viewpoint changes. Samples of three typical viewpoints and corresponding scenes are given.
a large variation in human appearance and motion over wide viewpoint changes? 3) How to deal with the changes in video frame rate or abrupt motion if using motion features on several consecutive frames? Viola and Jones [1] first made use of appearance and motion information in object detection, where they trained AdaBoosted classifiers with Harr features on two consecutive frames. Later Jones and Snow [2] extended this work by proposing appearance filter, difference filter and shifted difference filter on 10 consecutive frames and using predefined several categories of samples. The approaches in [1][2] can solve the 1st problem, but still face the challenge of the 3rd problem. The approach in [2] can handle the 2nd problem to some extent but as even human himself sometimes cannot tell which predefined category a moving object belongs to and thus its application will be limited, while the approach in [1] trains detectors by mixing all positives together. Dalal et al. [3] combined HOG descriptors and some motionbased descriptors together to detect humans with possibly moving cameras and backgrounds. Wojek et al. [4] proposed to combine multiple and complementary feature types and incorporate motion information for human detection, which coped with moving camera and clustered background well and achieved promising results on humans with a common frontal viewpoint. In this paper, our aim is to design a novel feature to take advantages of both appearance and motion information, and to propose an efficient learning algorithm to learn a practical detector of rational structure even when the samples are tremendously diverse for handling the difficulties mentioned above in one framework. The rest of this paper is organized as follows. Related work is introduced in Sec. 2. The proposed feature (I 2 CF ), the co-cluster algorithm (EMC-Boost) and the sampling strategy (MVS) are given in Sec. 3, Sec. 4 and Sec. 5 respectively and they are integrated to handle human detection in video in Sec. 6. Some experiments and conclusions are given in Sec. 7 and the last section respectively.
2
Related Work
In literature, human detection in video can be divided roughly into four categories. 1) Detection in static images as [5][6][7]. APCF [5], HOG [6] and Edgelet [7] are defined on appearance only. APCF compares colors or gradient orientations of two squares in images that can describe the invariance of color and gradient of an object to some extent. HOG computes oriented gradient distribution in a
Human Detection in Video over Large Viewpoint Changes
685
rectangle image window. An edgelet is a short segment of line or curve, which is predefined based on prior knowledge. 2) Detection over videos as [1][2]. Both of them are already mentioned in the previous section. 3) Object tracking as [8][9]. Some need manual initializations as in [8], and some are with the aid of detection as in [9]. 4) Detecting events or human behaviors. 3D volumetric features [10] are designed for event detection, which can be 3D harr like features. ST-patch [11] is used for detecting behaviors. Inspired by those works, we propose Intra-frame and Inter-frame Comparison Features (I 2 CF s) to combine appearance and motion information. Due to the large variation in human appearance over wide viewpoint changes, it is impossible to train a usable detector by taking the sample space as a whole. The solution is, divide and conquer, to cluster the sample space into some subspaces during training. A subspace can be dealt with as one class and the difficulty exists mainly in clustering the sample space. An efficient way is to cluster sample space automatically like in [12][13][14]. Clustered Boosting Tree (CBT) [14] splits sample space automatically by the already learned discriminative features during training process for pedestrian detection. Mixture of Experts [12] (MoE) jointly learns multiple classifiers and data partitions. It emphasizes much on local experts and is suitable when input data can be naturally divided into homogeneous subsets, which is even impossible for a fixed viewpoint of human as shown in Fig. 1. MC-Boost [13] co-clusters images and visual features by simultaneously learning image clusters and boosting classifiers. Risk map, defined on pixel level distance between samples, is also used to reduce the searching space of the weak classifiers in [13]. To solve our problem, we propose an Enhanced Multiple Clusters Boost (EMC-Boost) algorithm to co-cluster the sample space and discriminative features automatically which combines the benefits of Cascade [15], CBT [14] and MC-Boost [13]. The selection of EMC-Boost instead of MC-Boost is discussed in Sec. 7. Our contributions are summarized in four folds: 1) Intra-frame and Interframe Comparison Features (I 2 CF s) are proposed to combine appearance and motion information for human detection in video over large viewpoint changes; 2) An Enhanced Multiple Clusters Boost (EMC-Boost) algorithm is proposed to co-cluster the sample space and discriminative features automatically; 3) A Multiple Video Sampling (MVS) strategy is used to make our approach robust to human motion and video frame rate changes; 4) A two-stage tree structure detector is presented to fully mine the discriminative features of the appearance and motion information. The experiments in challenging real-world scenes show that our approach is robust to human motion and frame rate changes.
3 3.1
Intra-frame and Inter-frame Comparison Features Granular Space
Our proposed discriminative feature is defined in Granular space [16]. A granule is a square window patch defined in grey images, which is defined as a triplet g(x, y, s), where (x, y) is the position and s is the scale. For instance, g(x, y, s)
686
G. Duan, H. Ai, and S. Lao
indicates that the size of this granule is 2s × 2s and its left-top corner is at position (x, y) of an image. In an image I, it can be calculated as s
s
2 −1 2 −1 1 g(x, y, s) = s I(x + k, y + j). 2 × 2s j=0
(1)
k=0
s is set to 0,1,2 or 3 in this paper and the four typical granules are shown in Fig. 2 (a). In order to calculate the distance between two granules, Granular space G is mapped into 3D space , where for each element g ∈ G and γ ∈ , g(x, y, s) → γ(x + 2s , y + 2s , 2s ). The distance between two granules in G is defined to be the Euclidean distance between two corresponding points in , d(g1 , g2 ) = d(γ1 , γ2 ) where g1 , g2 ∈ G, γ1 , γ2 ∈ and γ1 , γ2 correspond to g1 , g2 respectively. d(γ1 (x1 , y1 , z1 ), γ2 (x2 , y2 , z2 )) = 3.2
(x1 − x2 )2 + (y1 − y2 )2 + (z1 − z2 )2 .
(2)
Intra-frame and Inter-frame Comparison Features (I 2 CF s)
Similar to the approach in [1], we consider two frames each time, the previous one and the latter one, from which two pairs of granules are extracted to fully capture the appearance and motion features of an object. An I 2 CF can be represented as a five-tuple c = (mode, g1i , g1j , g2i , g2j ), which is also called a cell according to [5]. The mode is Appearance mode, Difference mode or Consistent mode. g1i , g1j , g2i and g2j are four granules. The first pair of granules, g1i and g1j are from the previous frame to describe the appearance of an object. The second pair of granules, g2i and g2j comes from the previous or latter frame to describe either appearance or motion information. When the second pair are from the previous frame, which means that both pairs are from the previous frame, this kind of feature is Intra-frame Comparison Feature (Intra-frame CF); when the second pair come from the latter one, the feature becomes Inter-frame Comparison Feature (Inter-frame CF). Both of these two kinds of comparison features are combined to be Intra-frame and Inter-frame Comparison Feature (I 2 CF ). Appearance mode (A-mode). Pairing Comparison of Color feature (PCC) is proved to be simple, fast and efficient in [5] . As PCC can describe the invariance of color to some extent, we extend this idea to 3D space. A-mode compares two pairs of granules simultaneously: fA (g1i , g1j , g2i , g2j ) = g1i ≥ g1j &&g2i ≥ g2j .
(3)
PCC feature is a special case of A-mode. fA (g1i , g1j , g2i , g2j ) = g1i ≥ g1j , when g1i == g2i and g1j == g2j . Difference mode (D-mode). D-mode computes the absolute subtractions of two pairs of granules, defined as: fD (g1i , g1j , g2i , g2j ) = |g1i − g2i | ≥ |g1j − g2j |.
(4)
Human Detection in Video over Large Viewpoint Changes
g(10,6,3)
g3
g1
g1
g(16,2,0)
g(2,4,1)
g4
g2
g2
687
g3 g1 g4
g2
g(3,11,2)
(a)
(b)
(c)
(d)
Fig. 2. Our proposed I 2 CF . (a) Granular space with four scales (s = 0,1,2,3) of granules comes from [5]. (b)Two granules g1 and g2 connected by a solid line form one pair of granules applied in APCF [5]. (c) Two pairs of granules are used in each cell of I 2 CF . The solid line between g1 and g2 (or g3 and g4 ) means that g1 and g2 (or g3 and g4 ) come from the same frame. The dashed line connecting g1 and g3 (or g2 and g4 ) means that the locations of g1 and g3 (or g2 and g4 ) are related. This relation of locations is shown in (d). For example, g3 is in the neighborhood of g1 . This way reduces the feature pool a lot but still reserves the discriminative weak features.
The motion filters in [1][2] calculate the difference between one region and a shifted one by moving it up, left, right or bottom 1 or 2 pixels in the second frame. There are three main differences between the D-mode and those methods: 1) The restriction for the locations of these regions is defined spatially and much looser; 2) D-mode considers two pair of regions each time; 3) The only operation of D-mode is a comparison operator after subtractions. Consistent-mode (C-mode). C-mode compares the sums of two pairs of granules to take advantage of consistent information in the appearance of one frame or successive frames, defined as: fC (g1i , g1j , g2i , g2j ) = (g1i + g2i ) ≥ (g1j + g2j ).
(5)
C-mode is much simpler and can be quickly calculated compared with 3D volumetric features [10] and spatial temporal patches [11]. An I 2 CF of length n is represented as {c0 , c1 , · · · , cn−1 } and its feature value is defined as a binary concatenation of corresponding functions of cells in reverse order as fI 2 CF = [bn−1 bn−2 · · · b2 b1 ], where bk = f (mode, g1i , g1j , g2i , g2j ) for 0 ≤ k < n. ⎧ j j i i ⎪ ⎨fA (g1 , g1 , g2 , g2 ), mode = A, f (mode, g1i , g1j , g2i , g2j ) = fD (g1i , g1j , g2i , g2j ), mode = D, (6) ⎪ ⎩ j j i i fC (g1 , g1 , g2 , g2 ), mode = C. 3.3
Heuristic Learning I 2 CF s
Feature reduction. For 58 × 58 samples, there are 3s=0 (58 − 2s + 1) × (58 − 2s +1) = 12239 granules in total and the feature pool contains 3×122392 6.7× 1016 weak features without any restrictions, which make the training time and
688
G. Duan, H. Ai, and S. Lao
memory requirements impractical. With the distance of two granules defined in Sec. 3.1, two effective constraints are introduced into I 2 CF : 1)Motivated by [5], the first pair of granules in I 2 CF is constrained as d(g1i , g1j ) ≤ T1 . 2) Considering of the consistency in one frame or two near video frames, we constrain that the second pair of granules in I 2 CF is in the neighborhood of the first pair as shown in Fig. 2 (d): d(g1i , g2i ) ≤ T2 , d(g1j , g2j ) ≤ T2 . (7) We set T1 = 8, T2 = 4 in our experiments. Table 1. Learning algorithm of I 2 CF Input: Sample set S = {(xi , yi )|1 ≤ i ≤ m} where yi = ±1. Initialize: Cell space (CS) with all possible cells and empty I 2 CF . Output: The learned I 2 CF . Loop: – Learn the first pair of granules as [5]. Denote the best f pairs as a set F . – Construct a new set CS’ : In each cell of CS’, the first pair of granules is from F , the second pair of granules is generated by Eq. 7 and its mode is A-mode, D-mode or C-mode. Calculate Z value of I 2 CF by adding each cell in CS’. – Select the cell with the lowest Z value, denoted as c∗ . Add c∗ to I 2 CF . – Refine I 2 CF by replacing one or two granules in it without changing the mode.
Heuristically learning. I 2 CF starts with an empty I 2 CF . Each time select the most discriminative cell and add it to I 2 CF . The discriminability of a weak feature is measured by Z value, which reflects the classification power of the weak classifier as [17]: j j Z=2 W+ W− , (8) j
where W+j is the weight of positive samples that fall into the j th bin while W−j is that of negatives. The less Z value is, the more discriminative a weak feature is. The learning algorithm of I 2 CF is summarized in Table 1. (See more details in [5][16].)
4
EMC-Boost
We propose the EMC-Boost to co-cluster the sample space and discriminative features automatically. A perceptual clustering problem is shown in Fig. 3 (a)(c). EMC-Boost consists of three components, Cascade Component (CC), Mixed Component (MC) and Separated Component (SC). The three components are combined to become EMC-Boost. In fact, SC is similar to MC-Boost [13], which is the reason that our boosting algorithm is named as EMC-Boost. In the following descriptions, we formulate the three components explicitly first, and then demonstrate the learning algorithms, and summarize EMC-Boost at the end of this section.
Human Detection in Video over Large Viewpoint Changes
689
CC MC
Cluster
(a)
SC
(b)
Classifier
(c)
(d)
Fig. 3. A perceptual clustering problem in (a)-(c) and a general EMC-Boost in (d) where CC, MC and SC are three components of EMC-Boost
4.1
Three Components CC/MC/SC
CC deals with a standard 2-class classification problem that can be solved by any boosting algorithm. MC and SC deal with K clusters. We formulate the detectors of MC or SC as K strong classifiers, each of which is a linear combination of weak learners Hk (x) = t αkt hkt (x), k = 1, · · · , K with a threshold θk (default is 0). Note that the K classifiers Hk (x), k = 1, · · · , K are same in MC with K different thresholds θk , which means H1 (x) = H2 (x) = · · · = Hk (x), but they are totally different in SC. We present MC and SC uniformly below. The score yik of the ith sample belonging to the k th clusters can be computed as yik = Hk (xi )−θk . Therefore, the probability of xi belonging to the k th cluster is Pik (x) = 1+e1−yik . For aggregating all scores of one sample on K classifiers, we formulate Noisy-OR like [18][13] as Pi (x) = 1 −
K
(1 − Pik (xi )).
(9)
k=1
The cost function is defined as J = i Piti (1 − Pi )1−ti where ti ∈ {0, 1} is the label of ith sample, which is equivalent to maximize the log-likelihood ti log Pi + (1 − ti ) log(1 − Pi ). (10) log J = i
4.2
Learning Algorithms of CC/MC/SC
The learning algorithm of CC is directly Real Adaboost [17]. The learning algorithm of MC or is different from that classifiers of CC. MC and SC learn weak SC K th to maximize k w h (x ) and w h (x ) respectively at t round of ki kt i ki kt i i i boosting. Initially, the sample weights are: 1) For positives, wki = 1 if xi ∈ k and wki = 0 otherwise, where i denotes the ith sample and k denotes the k th cluster or classifier; 2) For all negatives we set wki = 1/K. Following the AnyBoost method [19], we set the sample weights as the derivative of the cost function w.r.t. the classifier score. The weight of k th classifier over ith sample is updated by ∂ log J ti − Pi wki = = Pki (xi ). (11) ∂yki Pi
690
G. Duan, H. Ai, and S. Lao
We sum up the training algorithms of MC and SC in Table 2. Table 2. Learning algorithms of MC and SC Input: Sample set S = {(xi , yi )|1 ≤ i ≤ m} where yi = ±1; Detection rate r in each layer. Output: Hk (x) = t αkt (x), k = 1, · · · , K. Loop: For t = 1, · · · , T (MC) – Find weak classifiers ht , (hkt = ht , k = 1, · · · , K) that maximize K k=1 i wki hkt (xi ). – Find the weak-learner weights αkt , (k = 1, · · · , K) that maximize Γ (H + αkt hkt ). – Update weights by Eq.11. (SC) For k = 1, · · · , K – Find weak classifiers hkt that maximize i wki hkt (xi ). – Find the weak-learner weights αkt that maximize Γ (H + αkt hkt ). – Update weights by Eq.11. Update thresholds θk (k = 1, · · · , K) to satisfy detection rate r.
4.3
A General EMC-Boost
The three components of EMC-Boost have different properties. CC takes all samples as one cluster while MC or SC considers that the whole sample space consists of multiple clusters. CC tends to distinguish positives from negatives; while SC tends to cluster the sample space, and MC can do both work at the same time but not so accurately as CC in classifying positives from negatives or as SC in clustering. Compared with SC, one particular advantage of MC is sharing weak features among all clusters. We combine these three components to become a general EMC-Boost as shown in Fig. 3 (d) which contains five steps:(Note that Step 2 is similar to CBT [14]) Step Step Step Step Step
5
1. CC learns a classifier for all samples considered as one category. 2. K-means algorithm clusters sample space with learned weak features. 3. MC clusters sample space coarsely. 4. SC clusters the sample space further. 5. CC learns a classifier for each cluster center.
Multiple Video Sampling
In order to deal with the change in video frame rate or abrupt motion problem, we introduce a Multiple Video Sampling (MVS) strategy as illustrated in Fig. 4 (a) and (b). Considering five consecutive frames in (a), a positive sample is made up of two frames, where one is the first frame and the other is from the next four frames as shown in (b). In other words, one annotation corresponds to five consecutive frames and generates 4 positives. Some more positives are shown in Fig. 4 (c). Suppose that the original frame rate is R and the used positives consist of the 1st and the rth frames (r > 1), then the possible frame rate covered by MVS strategy is R/(r − 1). If these positives are extracted from 30 fps videos, the trained detector is able to deal with 30fps(30/1),15fps(30/2),10fps(30/3) and 7.5fps(30/4) videos where r is 2, 3, 4 and 5 respectively.
Human Detection in Video over Large Viewpoint Changes
1
2
3
4
691
5
(a)
1
2
1
1
3
4
1
5
(b)
(c)
Fig. 4. The MVS strategy and some positive samples Appearance
Motion
23.3%
Appearance+Motion
34.9% ith cluster
kth cluster
15.5% 10.5%
1st stage
2nd stage (a)
1st stage
2nd stage
15.8%
(b)
Fig. 5. Two-Stage Tree Structure in (a) and an example in (b). The number in the box gives the percentage of samples belonging to that branch.
6
Overview of Our Approach
We adopt EMC-Boost selecting I 2 CF as weak features to learn a strong classifier for multiple viewpoint human detection, in which positive samples are achieved through MVS strategy. Due to the large amount of samples and features, it is difficult to learn a detector directly by a general EMC-Boost. We modify the detector structure slightly and propose a new structure containing two stages as shown in Fig. 5 (a) with an example in (b), which is called two-stage tree structure: In the 1st stage, it only uses appearance information for learning and clustering; In the 2nd stage, it uses both appearance and motion information for clustering first, and then for learning classifiers for all clusters.
7
Experiments
We carry out some experiments to evaluate our approach by False Positive Per Image (FPPI) on several real-world challenging datasets, ETHZ, PET2007 and our own collected dataset. When the intersection between a detection response and a ground-truth box is larger than 50% of their union, we consider it to be a successful detection. Only one detection per annotation is counted as correct. For simplicity, the three typical viewpoints mentioned in Fig. 1 are represented as Horizontal Viewpoint (HV), Slant Viewpoint (SV) and Vertical Viewpoint (VV) in turn from left to right. Datasets. The datasets used in the experiments are ETHZ dataset [20], PETS2007 dataset [21] and our own collected dataset. ETHZ dataset provides four video sequences, Seq.#0∼Seq.#3 (640×480 pixels at 15 fps). This dataset
692
G. Duan, H. Ai, and S. Lao
whose viewpoint is near HV is recorded using a pair of cameras, and we only use the images provided by the left camera. PETS2007 dataset contains 9 sequences S00∼S08 (720×576 pixels at 30 fps) and each sequence has 4 fixed cameras and we choose the 3rd camera whose viewpoint is near SV. There are 3 scenarios in PETS2007, with increasing scene complexity, loitering (S01 and S02), attended luggage removal (S03, S04, S05 and S06) and unattended luggage (S07 and S08). In the experiments, we use S01, S03, S05 and S06. In addition, we have collected several sequences by hand-held DV cameras: 2 sequences near HV (853×480 pixels at 30 fps), and 2 sequences near SV (1280×720 pixels at 30 fps) and 8 sequences near VV (2 sequences are 1280×720 pixels at 30 fps and the others are 704×576 pixels at 30 fps). Training and testing datasets. S01, S03, S05 and S06 of PETS2007 and our own dataset are labeled every five frames manually for training and testing, while ETHZ dataset provides the groundtruth already. The training datasets contain Seq.#0 of ETHZ, S01, S03 and S06 of PETS2007, 2 sequences (near HV), 2 sequences (near SV) and 6 sequences (near VV) of ours. The testing datasets contain Seq.#1, Seq.#2 and Seq.#3 of ETHZ, S05 of PETS2007, and 2 sequences of ours (near VV). Note that the groundtruths of the internal unlabeled frames in the testing datasets are achieved through interpolation. The properties of the testing datasets may have impacts on all detectors, like camera motion states (fixed or moving), illumination conditions (slightly or significantly light changes), etc. Details about the testing datasets are summarized in Table 3. Training detectors. We have labeled 11768 different humans in total and obtain 47072 positives after MVS, where the number of positives near HV, SV and VV are 18976, 19248 and 8848 respectively. The size of positives is normalized to 58 × 58. Some positives are shown in Fig. 4. We train a detector based on EMC-Boost selecting I 2 CF as features. Implementation details. We cluster the sample space into 2 clusters in the 1st stage and cluster the two sub spaces into 2 and 3 clusters separately in the 2nd stage as illustrated in Fig. 5 (b). When do we start and stop MC/SC? When the false positive rate is less than 10−2 (10−4 ) in the 1st (2nd ) stage, we start MC and then start SC after learning by MC. Before describing when to stop MC or SC, we first define transferred samples. A Table 3. Some details about the testing datasets Description
Seq. 1
Seq. 2
S05
Seq.#1 Seq.#2 Seq.#3
Source Camera Light changes Frame rate Size Frames Annotations
ours Fixed Slightly 30fps 704×576 2420 591
ours Fixed Slightly 30fps 704×576 1781 1927
PETS2007 Fixed Slightly 30fps 720×576 4500 17067
ETHZ Moving Slightly 15fps 640×480 999 5193
ETHZ Moving Slightly 15fps 640×480 450 2359
ETHZ Moving Significantly 15fps 640×480 354 1828
Human Detection in Video over Large Viewpoint Changes
693
Test Sequence 2 (ours, 1718 frames, 1927 annotations)
Test Sequence 1 (ours, 2420 frames, 591 annotations)
0.9
0.95
0.85
0.9
0.8
Recall
Recall
0.85
0.75
0.8
0.7
0.75
Intra-frame CF + Boost Intra-frame CF + EMC-Boost
Intra-frame CF + Boost Intra-frame CF + EMC-Boost
I2CF + EMC-Boost
I2CF + EMC-Boost 0.7 0.05
0.1
0.15
0.2
0.25
0.65 0.2
0.3
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
FPPI
FPPI S05 (PETS2007, 4500 frames, 17067 annotations)
Seq. #1 (ETHZ, 999 frames, 5193 annotations)
0.75
0.8 0.75
0.7
0.7 0.65 0.6
0.6
Recall
Recall
0.65
0.55
0.55 0.5 Ess et al. Schwartz et al. Wojek et al.(HOG, IMHwd and HIKSVM) Intra-frame CF + Boost Intra-frame CF + EMC-Boost
0.45
0.5
0.4
Intra-frame CF + Boost Intra-frame CF + EMC-Boost
0.45
0.35
2
I2CF + EMC-Boost
I CF + EMC-Boost 0.4 0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0
1
2
3 FPPI
FPPI Seq. #2 (ETHZ, 450 frames, 2359 annotations)
4
5
6
Seq. #3 (ETHZ, 354 frames, 1828 annotations)
0.75
0.9
0.7 0.8 0.65 0.7
0.6
Recall
Recall
0.6 0.55 0.5 0.45
Ess et al. Schwartz et al. Wojek et al.(HOG, IMHwd and HIKSVM) Intra-frame CF + Boost Intra-frame CF + EMC-Boost
0.4 0.35
0.5 Ess et al. Schwartz et al. Wojek et al.(HOG, IMHwd and HIKSVM) Intra-frame CF + Boost Intra-frame CF + EMC-Boost
0.4
0.3
I2CF + EMC-Boost 0
0.5
1
1.5
2 FPPI
2.5
I2CF + EMC-Boost 3
3.5
4
0.2
0
0.5
1
1.5
2 FPPI
2.5
3
3.5
4
Fig. 6. Evaluation of our approach and some results
sample is called transferred if it belongs to another cluster after current round boosting. We stop MC (SC) when the number of transferred samples is less than 10% (2%) of the total number of samples. Evaluation. To compare with our approach (denoted as I 2 CF +EMC-Boost), two other detectors are trained: one is to adopt Intra-frame CF learned by a general Boost algorithm like [5][15] (denoted as Intra-frame CF+Boost) and the other one is to adopt Intra-frame CF learned by EMC-Boost (denoted as Intraframe CF+EMC-Boost). Note that due to the large amount of Inter-frame CFs, the large amount of positives and memory limited, it is impractical to learn a detector of Inter-frame CF by Boost or EMC-Boost. We compare our approach with Intra-frame CF + Boost and Intra-frame CF + EMC-Boost approaches on PETS2007 dataset and our own collected videos, and also with [4][20][22] on ETHZ dataset. We give the ROC curves and some results in Fig. 6. In general, our proposed approach which integrates appearance and motion information is superior to Intra-frame CF+Boost and Intra-frame CF+EMC-Boost approaches which only use appearance information. From another viewpoint, this experiment also indicates that incorporating motion information improves detection significantly as [4]. Compared to [20][22], our approach is better. Furthermore, we have not used any additional cues like depth maps, ground-plane estimation, and occlusion reasoning, which are used in [20]. “HOG, IMHwd and HIKSVM” proposed in [4], combines HOG feature [18] and the Internal Motion Histogram wavelet difference (IMHwd) descriptor [3] together using histogram intersection kernel SVM (HIKSVM) and achieves the best results than other approaches used in [4]. Our approach is not as good as but comparable to “ HOG, IMHwd and HIKSVM”and we argue that our approach is much simpler and faster. Currently, our approach takes about 0.29s on ETHZ, 0.61s on PETS and 0.55s on our dataset to process one frame in average.
694
G. Duan, H. Ai, and S. Lao S05 of PETS2007
Test Sequence 2 of ours
Test Sequence 1 of ours 1
0.95
1
0.75
0.95
0.7
0.9
0.9
0.65
Recall
Recall
Recall
0.85 0.85
0.8
0.8
0.6
0.55
0.75 0.75
1 1/2 1/3 1/4 1/5
0.7
0.65 0.05
0.1
0.15
0.2
0.25
0.2
0.3
0.5
1 1/2 1/3 1/4 1/5
0.7 0.65
0.4
0.6
0.8
1
1.2
1.4
0.4 0.2
1.6
0.6
0.8
1
1.2
1.4
1.6
FPPI
Seq.#2 of ETHZ
Seq.# 3 of ETHZ
0.9
0.65
0.8 0.75
0.8
0.7
0.6 0.7
0.65
0.55
0.6
0.45
Recall
0.6
0.5
Recall
Recall
0.4
FPPI
FPPI Seq.#1 of ETHZ 0.7
0.5
0.4
1 1/2 1/3 1/4 1/5
0.3
0.4
0.6
0.8
1
1.2 FPPI
1.4
1.6
1.8
2
0.2
0.45
1 1/2 1/3 1/4 1/5
0.3
2.2
0.55 0.5
0.4 0.35
0.25 0.2
1 1/2 1/3 1/4 1/5
0.45
0
0.2
0.4
0.6
0.8 FPPI
1
1.2
1.4
1 1/2 1/3 1/4 1/5
0.4 0.35 1.6
0
0.2
0.4
0.6
0.8
1 FPPI
1.2
1.4
1.6
1.8
2
Fig. 7. The robustness evaluation of our approach to video frame rate changes. 1 represents the original frame rate, 1/2 represents that the testing dataset is downsampled to 1/2 original frame rate and so as 1/3, 1/4 and 1/5.
In order to evaluate the robustness of our approach to video frame rate changes, we downsample the videos to their 1/2, 1/3, 1/4 and 1/5 original frame rate to compare with the original frame rate. Our approach is evaluated on the testing datasets with frame rate changes, and the ROC curves are shown in Fig. 7. The results on the three sequences of ETHZ dataset and S05 of PETS2007 are similar, but the results on our collected two sequences differ a lot. The main reason is that: 1) as frame rates get lower, human motion changes more abruptly; 2) note that human motion in near VV videos changes more fiercely than that in near HV or SV ones. Our collected two sequences are near VV, so relatively human motion changes even more drastically than the other testing datasets. Generally speaking, our approach is robust to the video frame rate change to a certain extent. Discussions. We then discuss about the selection of EMC-Boost instead of MC-Boost and a general Boost algorithm. MC-Boost runs several classifiers together at any time, which may be good at classification problems, but not at detection problems. Considering the sharing feature at the beginning of clustering and the good clusters after clustering, we propose more suitable learning algorithms, MC and SC, for detection problems. Furthermore, risk map is an essential part of MC-Boost, in which the risk of one sample is related to predefined neighbors in the same class and in the opposite. But a proper neighborhood definition itself might be a tough question. For preserving the merits of MC-Boost and avoiding its shortcomings, we argue that EMC-Boost is more suitable for a detection problem. A general Boost algorithm can work well when the sample space has fewer variations, while EMC-Boost is designed to cluster the space into several sub ones which can make the learning process faster. But after clustering, a sample is considered as correct if it belongs to any sub space and thus it may also bring with more negatives. This may be the reason that Intra-frame CF+EMC-Boost
Human Detection in Video over Large Viewpoint Changes
695
is inferior to Intra-frame CF+Boost in Fig. 6. Mainly considering the cluster ability, we choose EMC-Boost other than a general Boost algorithm. In fact, it is impractical to learn a detector of I 2 CF +Boost without clustering because of the large amount of weak features and positives.
8
Conclusion
In this paper, we propose Intra-frame and Inter-frame Comparison Features (I 2 CF s), Enhanced Multiple Clusters Boost algorithm (EMC-Boost), Multiple Video Sampling (MVS) strategy and a two-stage tree structure detector to detect human in video over large viewpoint change. I 2 CF s combine appearance information and motion information automatically. EMC-Boost can cluster a sample space quickly and efficiently. MVS strategy makes our approach robust to frame rate and human motion. The final detector is organized as a two-stage tree structure to fully mine the discriminative features of the appearance and motion information. The evaluations on challenging datasets show the efficiency of our approach. There are some future works to be done. The large feature pool causes lots of difficulties during training. One work is to design a more efficient learning algorithm. MVS strategy can make the approach more robust to frame rate, but not handle arbitrary frame rates once a detector is learned. To achieve better results, another work may integrate object detection in video with object detection in static images and object tracking. An interesting question to EMCBoost is what kind of clusters EMC-Boost can obtain. Take human for example. Different poses, views, viewpoints or illumination will make human different and perceptual co-clusters of these samples differ with different criterion. The further relation between the discriminative features and samples is critical to the results. Another work may study the relations among features, objects and EMC-Boost. Our approach can also be applied to other object detection, multiple objects detection or object category as well. Acknowledgement. This work is supported by National Science Foundation of China under grant No.61075026, and it is also supported by a grant from Omron Corporation.
References 1. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: IEEE International Conference on Computer Vision, ICCV (2003) 2. Jones, M., Snow, D.: Pedestrian detection using boosted features over many frames. In: International Conference on Pattern Recognition (ICPR), Motion, Tracking, Video Analysis (2008) 3. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006)
696
G. Duan, H. Ai, and S. Lao
4. Wojek, C., Walk, S., Schiele, B.: Multi-cue onboard pedestrian detection. In: CVPR (2009) 5. Duan, G., Huang, C., Ai, H., Lao, S.: Boosting associated pairing comparison features for pedestrian detection. In: 9th Workshop on Visual Surveillance (2009) 6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 7. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: ICCV (2005) 8. Yang, M., Yuan, J., Wu, Y.: Spatial selection for attentional visual tracking. In: CVPR (2007) 9. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and peopledetection-by-tracking. In: CVPR (2008) 10. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV (2005) 11. Shechtman, E., Irani, M.: Space-time behavior based correlation. In: CVPR (2005) 12. Jordan, M., Jacobs, R.: Hierarchical mixture of experts and the em algorithm. Neural Computation 6, 181–214 (1994) 13. Kim, T.K., Cipolla, R.: Mcboost: Multiple classifier boosting for perceptual coclustering of images and visual features. In: Advances in Neural Information Processing Systems, NIPS (2008) 14. Wu, B., Nevatia, R.: Cluster boosted tree classifier for multi-view, multi-pose object detection. In: ICCV (2007) 15. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001) 16. Huang, C., Ai, H., Li, Y., Lao, S.: Learning sparse features in granular space for multi-view face detection. In: IEEE International Conference, Automatic Face and Gesture Recognition (2006) 17. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37, 297–336 (1999) 18. Viola, P., Platt, J., Zhang, C.: Multiple instance boosting for object detection. In: NIPS (2005) 19. Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent. In: Proc. Advances in Neural Information Processing Systems (2000) 20. Ess, A., Leibe, B., Gool, L.V.: Depth and appearance for mobile scene analysis. In: ICCV (2007) 21. PETS2007, http://www.cvg.rdg.ac.uk/PETS2007/ 22. Schwartz, W.R., Kembhavi, A., Harwood, D., Davis, L.S.: Human detection using partial least squares analysis. In: ICCV (2009)
Adaptive Parameter Selection for Image Segmentation Based on Similarity Estimation of Multiple Segmenters Lucas Franek and Xiaoyi Jiang Department of Mathematics and Computer Science University of M¨ unster, Germany {lucas.franek,xjiang}@uni-muenster.de
Abstract. This paper addresses the parameter selection problem in image segmentation. Mostly, segmentation algorithms have parameters which are usually fixed beforehand by the user. Typically, however, each image has its own optimal set of parameters and in general a fixed parameter setting may result in unsatisfactory segmentations. In this paper we present a novel unsupervised framework for automatically choosing parameters based on a comparison with the results from some reference segmentation algorithm(s). The experimental results show that our framework is even superior to supervised selection method based on ground truth. The proposed framework is not bounded to image segmentation and can be potentially applied to solve the adaptive parameter selection problem in other contexts.
1
Introduction
The parameter selection problem is a fundamental and unsolved problem in image segmentation. Mostly, segmentation algorithms have parameters, which for example control the scale of grouping and are typically fixed by the user. Although some appropriate range of parameter values can be well shared by a large number of images, the optimal parameter setting within that range varies substantially for each individual image. In the literature the parameter selection problem is mostly either not considered at all or solved by learning a fixed optimal parameter setting based on manual ground truth [1]. However, it is a fact that a fixed parameter setting may achieve good performance in average, but cannot reach the best possible segmentation for each individual image. It is thus desirable to have a means of adaptively selecting the optimal parameter setting on a per image basis. Unsupervised parameter selection depends on some kind of self-judgement of segmentation quality, which is not an easy task [2, 3]. In this paper a novel framework of unsupervised automatic parameter selection is proposed. Given the fundamental difficulty of the self-judgement of segmentation quality, we argue that the situation becomes easier when we consider some reference segmentations. A comparison of their results may give us a hint of good segmentations. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 697–708, 2011. Springer-Verlag Berlin Heidelberg 2011
698
L. Franek and X. Jiang
The paper is organized as follows. In Section 2 we review shortly different existing approaches to the parameter selection problem. Our novel framework is detailed in Section 3. Experimental results including the comparison with an existing supervised method are shown in Section 4. We conclude in Section 5.
2
Parameter Selection Methods
Parameter selection methods can be classified into supervised and unsupervised approaches. Supervised methods use ground truth for the evaluation of segmentation quality. If ground truth is available, the parameter space can be sampled into a finite number of parameter settings and the segmentation performance can be evaluated for each parameter setting over a set of training images. The authors of [1] propose a multi-locus hill climbing scheme on a coarsely sampled parameter space. As an alternative, genetic search has been chosen in [4] to find the optimal parameter setting. Supervised approaches are appropriate to find the best fixed parameter setting. However, they are applicable only in case of availability of ground truth, which causes considerable amount of manual work. In addition, the learned fixed parameter set allows good performance in average only, but cannot reach the best possible segmentation on a per image basis. Unsupervised parameter selection methods do not use ground truth, but instead are based on heuristics to measure segmentation quality. The authors of [5] design an application-specific segmentation quality function and a search algorithm is applied to find the optimal parameter setting. A broad class of unsupervised algorithms use characteristics such as intra-region uniformity or inter-region contrast; see [2, 3] for a review. They evaluate a segmented image based on how well a set of characteristics is optimized. In [3] it is shown that segmentations resulting from optimizing such characteristics do not necessarily correspond to segmentations perceived by humans. Another unsupervised approach to tackling the parameter selection problem is proposed in [6] by an ensemble technique. A segmentation ensemble is produced by sampling a reasonable parameter subspace of a segmentation algorithm and a combination solution is then computed from the ensemble, which is supposed to be better than the most input segmentations. The method proposed in this work is related to the stability-based approach proposed in [7]. There the authors consider the notion of stability of clusterings/segmentations and observe that the number of stable segmentations is much smaller than the total number of possible segmentations. Several parameter settings leading to stable results are returned by the method and it is assumed that stable segmentations are good ones. In order to check for stability the data has to be perturbed many times by adding noise and re-clustered in a further step. Similar to [7] we argue in our work that stable segmentations are good ones but the realisation of this observation is quite different. In our work a reference segmenter is introduced to detect stable segmentations and there is no need for perturbing and re-clustering the data. In this work we propose a novel framework for automatic parameter selection, which is fully unsupervised and does not need any ground truth for training. Our
Adaptive Parameter Selection for Image Segmentation
699
Fig. 1. Left: Image from Berkeley dataset [8]. Right: its ground truth segmentation.
framework is based on comparison of segmentations generated by different segmenters. To estimate the parameters for one segmenter a reference segmenter is introduced which also has an unknown optimal parameter setting. The fundamental idea behind our work (to be detailed in the next section) is the observation that different segmentation algorithms generate different bad segmentations, but similar good segmentations. It should be emphasized that in contrast to existing unsupervised methods no segmentation quality function and no segmentation characteristics are used to estimate the optimal parameter setting.
3
Framework for Unsupervised Parameter Selection
In this section we explain the fundamental idea of our framwork. Then, details are given to show how the framework can be further concretized. 3.1
Proposed Framework
Our framework is based on the comparison of segmentations produced by different segmentation algorithms. Our basic observation is that different segmenters tend to produce similar good segmentations, but dissimilar bad segmentations. Our observation is reasonable because every segmenter is based on different methodologies, therefore it may be assumed that over- and under-segmentations of different segmenters look differently and potential misplacement of segment contours is different. Another interpretation is related to the space of all segmentations of an image. Generally, the subspace of bad segmentations is substantially larger than the subspace of good segmentations. Therefore, the segmentation diversity can be expected to appear in the subspace of bad segmentations. Based on our observation we propose to compare segmentation results of different segmenters and to figure out good segmentations by means of similarity tests. Furthermore, the notion of stability [7] is adjusted: A segmentation is denoted as stable if a reference segmenter yields a similar segmentation. Illustrative example. Consider the image from Berkeley dataset [8] and its ground truth segmentation in Fig. 1. We use two segmentation algorithms for this illustrative example: the graph-based image segmentation algorithm FH [9]
700
L. Franek and X. Jiang
1.N M I = 0.54 2.N M I = 0.72 3.N M I = 0.67 4.N M I = 0.70 5.N M I = 0.74
6.N M I = 0.70 7.N M I = 0.76 8.N M I = 0.74 9.N M I = 0.76 10.N M I = 0.76
(a) Segmentations generated by FH
1.N M I = 0.52 2.N M I = 0.71 3.N M I = 0.78 4.N M I = 0.81 5.N M I = 0.82
6.N M I = 0.80
7.N M I = 0.80
8.N M I = 0.74
9.N M I = 0.72
10.N M I = 0.72
(b) Segmentations generated by UCM Fig. 2. Segmentations for each algorithm were received by varying parameters
and the segmenter UCM [10] based on the ultrametric contour map. For each segmenter 10 segmentations were generated by varying parameters (Fig. 2a,b). The segmentation algorithms and parameter settings will be detailed in Section 3.2. For performance evaluation purpose the normalized mutual information (NMI) (to be defined in Section 3.3) is used. The performance curves are shown in Fig. 3a. To obtain good parameters for FH we use UCM as reference segmenter. Every segmentation result of FH is compared with every segmentation result of UCM by using NMI as similarity measure, resulting in the similarity matrix shown in Fig. 3b. This similarity matrix indicates which pair of FH and UCM segmentations is similar or dissimilar. The maximal similarity is detected in column 9 and row 5, which corresponds to the ninth FH-segmentation and the fifth UCM-segmentation. And in fact, this segmentation for FH has the highest NMI value when compared with ground truth; see Fig. 3a. Note that the columns and rows in the similarity matrix deliver information about the first and second segmenter in the comparison, respectively. In our
Adaptive Parameter Selection for Image Segmentation
701
FH UCM 0.9 1
0.85
0.75
Parameter setting for UCM
Performance measure
2
0.8 0.75
0.7 0.65 0.6 0.55
0.5
3
0.7
4 0.65
5 6
0.6 7 8
0.55
9 0.5
10
1
2
3
4
5
6
Parameter setting
7
8
9
10
(a) Performance evaluation of segmentations compared with ground truth.
1
2
3
4
5
6
7
Parameter setting for FH
8
9
10
(b) Similarity matrix for “FH – UCM”.
Fig. 3. Similarity matrices generated by comparing pairs of segmentations of different segmentation algorithms. NMI was used for similarity estimation.
example in Fig. 3b therefore columns represent the corresponding FH segmentations. The columns 1 to 6 are darker than the columns 7 to 10. This means that the segmentations 1 to 6 do not concur with the FH segmentations well and thus indicates that they tend to be bad segmentations. Conversely, FH can be used as reference segmenter for obtaining the optimal parameter setting for UCM: the fifth UCM-segmentation has also the highest NMI value (see Fig. 3a). Because of the highest values in the similarity matrix the ninth FH-segmentation and the fifth UCM-segmentation are denoted as most stable. Outline of the framework. The example above can be generalized to outline our framework as follows: 1. Compute for each segmentation algorithm N segmentations. 2. Compute an N × N similarity matrix by comparing each segmentation of the first algorithm with each segmentation of the second algorithm. 3. Determine the best parameter setting from the similarity matrix. The first two steps are relatively straightforward. We only need to specify a reasonable parameter subspace for each involved segmentation algorithm and a similarity measure. The much more demanding last step requires a careful design and will be discussed in detail later. We have implemented a prototype to validate our framework, which will be detailled in the following three subsections. 3.2
Segmentation Algorithms Used in Prototypical Implementation
We decided to use three different segmentation algorithms in our current prototypical implementation: the graph-based approach FH [9], the algorithm TBES [11] based on the MDL-principle, and UCM [10]. The reasons for our choice are: 1) algorithms like FH, TBES and UCM are state-of-the-art and frequently referenced in the literature; 2) their code is available; 3) they are sensitive to parameter selection. In contrast we found that the segmenter EDISON [12] is
702
L. Franek and X. Jiang
not as sensitive to parameter changes as the selected algorithms and a fixed parameter set leads in the majority of cases to the optimal solution (see [13]). For each segmentation algorithm every image of Berkeley dataset is segmented 10 times by varying one free parameter, i.e. a reasonable one-dimensional parameter space is sampled into 10 equidistant parameter settings. FH has three parameters: a smoothing parameter (σ), a threshold function (k) and a minimum component size (min size). We have chosen to vary the smoothing parameter σ = 0.2, 0.4, . . . , 2.0 because it seemed to be most sensitive parameter with respect to the segmentation results. The other parameters are fixed to the optimal setting (k = 350, min size = 1500) as it is done in the supervised parameter learning approach. (The case of several free parameters will be discussed in Section 3.5.) The segmenter TBES has only one free parameter, the quantization error which is set to = 40, 70, . . . , 310. The method UCM consists of an oriented watershed transform to form initial regions from contours, followed by construction of an ultrametric contour map (UCM). Its only parameter is the threshold l, which controls the level of detail, we choose l = 0.03, 0.11, 0.19, 0.27, 0.35, 0.43, 0.50, 0.58, 0.66, 0.74. 3.3
Similarity Measure Used in Prototypical Implementation
In our framework a segmentation evaluation measure is needed for similarity estimation (and also for performance evaluation purpose). We use the same measure for both tasks in our experiments. The question which measure is best for comparing segmentations remains difficult and researchers try to construct measures which account for over-, under-segmentation, inaccurate boundary localization and different number of segments [13, 14]. For our work we decided to use two different measures, NMI and F-measure, to demonstrate the performance. NMI is an information-theoretic measure and was used in the context of image segmentation in [6, 15]. Let Sa and Sb denote two labellings, each representing one segmentation. Furthermore, |Sa | and |Sb | denote the number of groups within Sa and Sb , respectively. Then NMI is formally defined by |Sa | |Sb |
|Rh,l | log
n · |Rh,l | |Rh | · |Rl |
NMI(Sa , Sb ) = h=1 l=1 |Sa | |Sb | |Rh | |Rl | |Rh | log |Rl | log n n h=1
(1)
l=1
where Rh and Rl are regions from Sa and Sb , respectively. Rh,l denotes the common part of Rh and Rl , and n is the image size. Its value domain domain is [0, 1], whereas a high (low) NMI value indicates high (low) similarity between two labellings. Therefore, a high (low) NMI value indicates a good (bad) segmentation. The boundary-based F-measure [16] is the weighted harmonic mean of precision and recall. Its value domain is also [0, 1] and a high (low) F-measure value indicates a good (bad) segmentation.
Adaptive Parameter Selection for Image Segmentation
3.4
703
Dealing with Similarity Matrix
The essential component of our framework is the similarity matrix, which has the deciding influence to the overall performance. Below we discuss various design strategies to concretize the framework towards a real implementation. Similarity measure with penalty term. In our discussion so far the similarity matrix is assumed to be built by computing the pairwise similarity measure such as NMI or F-measure. In practice, however, we need a small extension with a penalty term for handling the over-segmentation problem. If a pair of segmentations tend to be over-segmentation, they naturally will look similar. In the extreme case where each pixel is partitioned as a segment, two segmentations will be equal, leading to a maximal similarity value. Indeed, we observed that in some cases pairs of strongly over-segmented segmentations are most similar and dominate the similarity matrix. Thus, we need a means of correcting such forged similarity values. There are several possibilities to handle this fundamental problem. We found that simply counting the regions of each segmentation and adding a small penalty (0.1) to the performance measure, if a threshold is reached, is sufficient for our purpose. Note that this penalty term does not affect the rest of computing the similarity matrix. Nonlocal maximum selection. In the illustrative examples the best parameter setting was selected by detecting the maximal value in the similarity matrix. This strategy is suitable for introduction purpose, but may be too local to be really useful in practice. A first idea came up, which performs a weighted smoothing of the similarity matrix and then determines the maximum. Alternatively, as a second strategy, when searching the best result for the first (second) segmenter we can compute for each column (row) one representative by averaging over the k maximal values in each column (row) and then searching for the maximum over the column (row) representatives. In our study k was set to 5. In Fig. 4a,b the row/column representatives for UCM and FH are shown. Indeed, as explained in Section 3.1 columns particularly deliver information about the first segmenter and rows deliver information about the second segmenter in a comparison. Therefore, in general a low-valued column (row) only may eliminate bad segmentations from the first (second) segmenter. Extensive experiments showed that this second strategy delivers better results. Using several reference segmenters. It can be thought about combining different similarity matrices gained from multiple reference segmenters. For instance, we can compute a combined result for TBES based on the similarity matrices from the comparisons “TBES – FH” and “TBES – UCM”. In this case representatives of each column are computed and added (Fig. 4c). From the resulting column the maximal value is selected and the corresponding parameter is assumed to be the best one.
704
L. Franek and X. Jiang
2
1
0.56
0.58
3
0.6
7
6
5
4
Parameter setting for UCM 0.62
0.64
0.66
0.68
0.7
8
0.72
9
10
0.74
0.76
(a) Row representatives for UCM received from comparison “FH – UCM” by using nonlocal maximum selection.
1
2
0.64
3
4
6
5
7
Parameter setting for FH 0.66
0.68
9
8
0.7
0.72
10
0.74
(b) Column representatives for FH received from comparison “FH – TBES” by using nonlocal maximum selection.
1
1.34
3
2
1.36
4
5
6
7
Parameter setting for TBES 1.38
1.4
1.42
1.44
8
9
1.46
10
1.48
(c) Column representatives for TBES received by combining two reference segmenters (FH and UCM).
Fig. 4. (a) and (b): Row/column representatives received from similarity matrices by averaging over the five best values in each row in. (c): Column representatives received by using two reference segmenters.
3.5
Handling Several Free Parameters
Our framework is able to work with an arbitrary number of free parameters, but for simplicity and better visualization reasons we decided to use only one free parameter for each segmenter for our prototypical implementation as a first validation of our framework. However, we want also to point out shortly two strategies for handling several free parameters. A straightforward use of our approach for multiple parameters will provoke increased computation time. To solve this problem efficiently, it can be thought of an evolutionary algorithm. Given only one pair of segmentations the evolutionary algorithm could start in the corresponding point of the similarity matrix and converge to a minimum. Several starting points could be considered to avoid trapping in a local minimum. By this way similarity matrix of bigger dimension encoding several free parameters could be handled. Another solution could be a multi-scale approach, in which all sensitive parameters are considered, but roughly sampled at the beginning only. In the next steps local optimum areas are further partitioned and explored.
4
Experimental Results
In this section we report some results of the current prototypical implementation. In particular, we compare our framework of adaptive parameter selection with the supervised automated parameter training. For our experiments we use the Berkeley dataset. It provides for each image several different ground truth segmentations. Therefore, we decided to select the ground truth image with the maximum sum of distances to the other ground truth images, whereas NMI resp. F-measure is used as distance measure. Supervised automated parameter training. We compute the average performance measure over all 300 images of Berkeley dataset for each parameter setting candidate and the one with the largest value is selected as the optimal fixed parameter setting. For this training the same parameter values as in Section 3.2 are used. Table 1 summarizes the average performance measure for each algorithm based on these fixed parameters and symbolizes the performance of
Adaptive Parameter Selection for Image Segmentation
705
Table 1. Average performance achieved by individual best and fixed parameter setting NMI F-measure individual fixed individual fixed FH TBES UCM
0.65 0.69 0.70
0.60 0.64 0.66
0.53 0.59 0.64
0.47 0.54 0.60
supervised parameter setting. For comparison purposes the best individual performance is also computed for each algorithm and each image, see the averaged values over all images displayed in Table 1. Note that this performance measure is in some sense the limit the respective algorithm can achieve. Adaptive parameter selection. As said before, for simplicity and better visualization reasons we decided to adaptively select only one free parameter for each segmenter for our prototypical implementation as a first validation of our framework. The other parameters are taken from the supervised learning. To compare our framework with the supervised method both approaches are applied to each image of Berkeley dataset. NMI is used for both similarity and performance evaluation. The resulting histograms are plotted in Fig. 5. For better comparison the performance measure values are centred for each image by the best individual performance; i.e. the differences NMI − NMIbest are histogrammed. For instance, the histogram in Fig. 5a displays the performance behavior of FH: 1) compared to the results received by the fixed learned parameter setting (the left bar in each bin); 2) supported by a single reference segmenter (FH or UCM; the two middle bars in each bin); 3) supported by two reference segmenters (the right bar in each bin). The fixed parameter setting leads in 107 of 300 cases (blue bar in the rightmost bin) to the optimal solution. In 114 of all images the performance is about 0.05 below the optimal solution and so on. A reference segmenter TBES alone lifts the number of optimal solutions from 107 to 190. A similar picture can be seen for using UCM as the reference segmenter. For FH using both reference segmenters leads to the best result: the optimal parameter setting is achieved in 197 of 300 images. Note that an ideal parameter selection method would lead to a single bar located at zero in the histogram. The finding that the support of TBES, UCM, or both together substantially increases the performance of FH is perhaps not surprising since TBES and UCM are better performing segmenters, see Table 1. But important is the fact that our framework can realize the intrinsic support technically. Results for TBES and UCM are shown in Fig. 5a and Fig. 5c, respectively. For TBES the adaptive method (with both FH and UCM) achieves in 192 of all images the optimal parameter setting, in contrast the fixed parameter in 124 only. Interestingly, the weaker segmenter FH enables a jump of 124 optimal parameter settings to 183. UCM seems to be more stable with respect to the parameter from the supervised parameter learning. Nevertheless also in this case a remarkable improvement can be observed.
L. Franek and X. Jiang 250
150
100
50
0
250
Fixed Parameter Ref.Segm. FH Ref.Segm. UCM Both Reference Segmentors
200
number of images
number of images
250
Fixed Parameter Ref.Segm. TBES Ref.Segm. UCM Both Reference Segmentors
200
150
100
50
−0.5
−0.4
−0.3
−0.2
performance
(a) FH
−0.1
0
0
Fixed Parameter Ref.Segm. FH Ref.Segm. TBES Both Reference Segmentors
200
number of images
706
150
100
50
−0.5
−0.4
−0.3
−0.2
performance
(b) TBES
−0.1
0
0
−0.5
−0.4
−0.3
−0.2
performance
−0.1
0
(c) UCM
Fig. 5. Performance histograms for FH, TBES and UCM. NMI was used for evaluation. Table 2. Average performance using NMI for similarity estimation and performance evaluation. “Ref.segm.” specifies the reference segmenter.
FH TBES UCM
avg ref.segm. FH TBES UCM TBES & UCM FH & UCM FH & TBES -0.053 * -0.036 -0.034 -0.027 * * -0.054 -0.035 * -0.038 * -0.033 * -0.044 -0.041 -0.044 * * * -0.033
To get a quantitative and contrastable value the average over the performance values in the histogram is computed (Table 2). It can be seen from the values that using two reference segmenters always yields the best results. Experiments are also executed using F-measure for similarity estimation and performance evaluation (Table 3). In this case using both reference segmenters yields the best results. Only for UCM the supervised learning approach outperforms our framework. This is due to the fact that for F-measure the difference between UCM and the other segmenters is remarkable (Table 1) and for the reference segmenters it gets harder to optimize UCM. Nevertheless it must be emphasized that by using both reference segmenters results close to the supervised learning approach are received without knowing ground truth. To further optimize the results for UCM another comparable reference segmentor would be necessary. 4.1
Discussion
Our experiments confirmed our observation stated in Section 2: For the majority of images different segmentation algorithms tend to produce similar good segmentations and dissimilar bad segmentations. This fundamental fact allowed us to show the superiority of our adaptive parameter selection framework compared to the supervised method based on parameter learning. From a practical point of view it is even more important to emphasize that ground truth is often not available and thus the supervised learning approach cannot be applied at all. Our framework may provide a useful tool for dealing with the omnipresent parameter selection problem in such situations. Our approach depends on the assumption that the reference segmenter generates for at least one parameter setting a good segmentation. If it totally fails,
Adaptive Parameter Selection for Image Segmentation
707
Table 3. Average performance using F-measure for similarity estimation and performance evaluation. “Ref.segm.” specifies the reference segmenter.
FH TBES UCM
avg ref.segm. FH TBES UCM TBES & UCM FH & UCM FH & TBES -0.063 * -0.036 -0.037 -0.032 * * -0.050 -0.048 * -0.043 * -0.041 * -0.043 -0.065 -0.052 * * * -0.047
then there may be not much chance to recover a good segmentation from the similarity matrix. However, this is a weak requirement on the reference segmenter and does not pose a real problem in practice. We also want to emphasize that our approach substantially depends on the question how good similarity between segmentations can be quantitatively measured. For almost optimal segmentations near ground truth this is not a big challenge since all similarity measures can be expected to work well. In more practical cases the similarity measure should be chosen with care. The experimental results indicate that NMI and F-measure are appropriate for this task. The overall runtime of our current implementation mainly depends on the runtime of each segmenter. The complex segmenters TBES and UCM need a few minutes to segment an image which is the most critical task (for FH the generation takes only a few seconds). The estimation of similarity takes a few seconds for NMI and few minutes for F-measure. Our framework is inherently parallel and optimized implementation would lead to a significant speed-up.
5
Conclusion
In this paper we have presented a novel unsupervised framework for adaptive parameter selection based on comparison of segmentations generated by different segmentation algorithms. Fundamental to our method is the observation that bad segmentations generated by different segmenters are dissimilar. Motivated by this fact we proposed to adjust the notion of stability in the context of segmentation. Stable segmentations are found by using a reference segmenter. Experiments justified our observation and showed the superiority of our approach compared to the supervised parameter training. In particular, such an unsupervised approach is helpful in situations without ground truth. In contrast to the most existing unsupervised methods our approach does not use any heuristic self-judgement characteristics for segmentation evaluation. Although the framework has been presented and validated in the context of image segmentation, the idea behind is general and can be adapted to solving the adaptive parameter selection in other tasks. This is indeed under study currently. In addition, we will further improve the implementational details of our framework, for instance combination of multiple similarity measures and enhanced strategies of dealing with the similarity matrix. Using more than one reference segmenters has not been fully explored in this work yet and deserves further investigation as well.
708
L. Franek and X. Jiang
References 1. Min, J., Powell, M.W., Bowyer, K.W.: Automated performance evaluation of range image segmentation algorithms. IEEE Trans. on Systems, Man, and Cybernetics, Part B 34, 263–271 (2004) 2. Chabrier, S., Emile, B., Rosenberger, C., Laurent, H.: Unsupervised performance evaluation of image segmentation. EURASIP J. Appl. Signal Process. (2006) 3. Zhang, H., Fritts, J.E., Goldman, S.A.: Image segmentation evaluation: A survey of unsupervised methods. Computer Vision and Image Understanding 110, 260–280 (2008) 4. Pignalberi, G., Cucchiara, R., Cinque, L., Levialdi, S.: Tuning range image segmentation by genetic algorithm. EURASIP J. Appl. Signal Process., 780–790 (2003) 5. Abdul-Karim, M.A., Roysam, B., Dowell-Mesfin, N., Jeromin, A., Yuksel, M., Kalyanaraman, S.: Automatic selection of parameters for vessel/neurite segmentation algorithms. IEEE Trans. on Image Processing 14, 1338–1350 (2005) 6. Wattuya, P., Jiang, X.: Ensemble combination for solving the parameter selection problem in image segmentation. In: Proc. of Int. Conf. on Pattern Recognition, pp. 392–401 (2008) 7. Rabinovich, A., Lange, T., Buhmann, J., Belongie, S.: Model order selection and cue combination for image segmentation. In: CVPR, pp. 1130–1137 (2006) 8. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. ICCV, vol. 2, pp. 416–423 (2001) 9. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Computer Vision 59, 167–181 (2004) 10. Arbelaez, P., Maire, M., Fowlkes, C.C., Malik, J.: From contours to regions: An empirical evaluation. In: CVPR, pp. 2294–2301. IEEE, Los Alamitos (2009) 11. Rao, S., Mobahi, H., Yang, A.Y., Sastry, S., Ma, Y.: Natural image segmentation with adaptive texture and boundary encoding. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5994, pp. 135–146. Springer, Heidelberg (2010) 12. Christoudias, C.M., Georgescu, B., Meer, P., Georgescu, C.M.: Synergism in low level vision. In: Int. Conf. on Pattern Recognition, pp. 150–155 (2002) 13. Unnikrishnan, R., Pantofaru, C., Hebert, M.: Toward objective evaluation of image segmentation algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence 29, 929–944 (2007) 14. Monteiro, F.C., Campilho, A.C.: Performance evaluation of image segmentation. In: Int. Conf. on Image Analysis and Recognition, vol. (1), pp. 248–259 (2006) 15. Fowlkes, C., Martin, D., Malik, J.: Learning affinity functions for image segmentation: combining patch-based and gradient-based approaches. In: Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 54–61 (2003) 16. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. on Pattern Analysis and Machine Intelligence 26, 530–539 (2004)
Cosine Similarity Metric Learning for Face Verification Hieu V. Nguyen and Li Bai School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK {vhn,bai}@cs.nott.ac.uk http://www.nottingham.ac.uk/cs/
Abstract. Face verification is the task of deciding by analyzing face images, whether a person is who he/she claims to be. This is very challenging due to image variations in lighting, pose, facial expression, and age. The task boils down to computing the distance between two face vectors. As such, appropriate distance metrics are essential for face verification accuracy. In this paper we propose a new method, named the Cosine Similarity Metric Learning (CSML) for learning a distance metric for facial verification. The use of cosine similarity in our method leads to an effective learning algorithm which can improve the generalization ability of any given metric. Our method is tested on the state-of-the-art dataset, the Labeled Faces in the Wild (LFW), and has achieved the highest accuracy in the literature.
Face verification has been extensively researched for decades. The reason for its popularity is the non-intrusiveness and wide range of practical applications, such as access control, video surveillance, and telecommunication. The biggest challenge in face verification comes from the numerous variations of a face image, due to changes in lighting, pose, facial expression, and age. It is a very difficult problem, especially using images captured in totally uncontrolled environment, for instance, images from surveillance cameras, or from the Web. Over the years, many public face datasets have been created for researchers to advance state of the art and make their methods comparable. This practice has proved to be extremely useful. FERET [1] is the first popular face dataset freely available to researchers. It was created in 1993 and since then research in face recognition has advanced considerably. Researchers have come very close to fully recognizing all the frontal images in FERET [2–6]. However, these methods are not robust to deal with non-frontal face images. Recently a new face dataset named the Labeled Faces in the Wild (LFW) [7] was created. LFW is a full protocol for evaluating face verification algorithms. Unlike FERET, LFW is designed for unconstrained face verification. Faces in LFW can vary in all possible ways due to pose, lighting, expression, age, scale, and misalignment (Figure 1). Methods for frontal images cannot cope with these variations and as such many researchers have turned to machine learning to R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 709–720, 2011. Springer-Verlag Berlin Heidelberg 2011
710
H.V. Nguyen and L. Bai
Fig. 1. From FERET to LFW
develop learning based face verification methods [8, 9]. One of these approaches is to learn a transformation matrix from the data so that the Euclidean distance can perform better in the new subspace. Learning such a transformation matrix is equivalent to learning a Mahalanobis metric in the original space [10]. Xing et al. [11] used semidefinite programming to learn a Mahalanobis distance metric for clustering. Their algorithm aims to minimize the sum of squared distances between similarly labeled inputs, while maintaining a lower bound on the sum of distances between differently labeled inputs. Goldberger et al. [10] proposed Neighbourhood Component Analysis (NCA), a distance metric learning algorithm especially designed to improve kNN classification. The algorithm is to learn a Mahalanobis distance by minimizing the leave-one-out cross validation error of the kNN classifier on a training set. Because it uses softmax activation function to convert distance to probability, the gradient computation step is expensive. Weinberger et al. [12] proposed a method that learns a matrix designed to improve the performance of kNN classification. The objective function is composed of two terms. The first term minimizes the distance between target neighbours. The second term is a hinge-loss that encourages target neighbours to be at least one distance unit closer than points from other classes. It requires information about the class of each sample. As a result, their method is not applicable for the restricted setting in LFW (see section 2.1). Recently, Davis et al. [13] have taken an information theoretic approach to learn a Mahalanobis metric under a wide range of possible constraints and prior knowledge on the Mahalanobis distance. Their method regularizes the learned matrix to make it as close as possible to a known prior matrix. The closeness is measured as a Kullback-Leibler divergence between two Gaussian distributions corresponding to the two matrices. In this paper, we propose a new method named Cosine Similarity Metric Learning (CSML). There are two main contributions. The first contribution is
Cosine Similarity Metric Learning for Face Verification
711
that we have shown cosine similarity to be an effective alternative to Euclidean distance in metric learning problem. The second contribution is that CSML can improve the generalization ability of an existing metric significantly in most cases. Our method is different from all the above methods in terms of distance measures. All of the other methods use Euclidean distance to measure the dissimilarities between samples in the transformed space whilst our method uses cosine similarity which leads to a simple and effective metric learning method. The rest of this paper is structured as follows. Section 2 presents CSML method in detail. Section 3 present how CSML can be applied to face verification. Experimental results are presented in section 4. Finally, conclusion is given in section 5.
1
Cosine Similarity Metric Learning
The general idea is to learn a transformation matrix from training data so that cosine similarity performs well in the transformed subspace. The performance is measured by cross validation error (cve). 1.1
Cosine Similarity
Cosine similarity (CS) between two vectors x and y is defined as: CS(x, y) =
xT y x y
Cosine similarity has a special property that makes it suitable for metric learning: the resulting similarity measure is always within the range of −1 and +1. As shown in section 1.3, this property allows the objective function to be simple and effective. 1.2
Metric Learning Formulation
Let {xi , yi , li }si=1 denote a training set of s labeled samples with pairs of input vectors xi , yi ∈ Rm and binary class labels li ∈ {1, 0} which indicates whether xi and yi match or not. The goal is to learn a linear transformation A : Rm → Rd (d ≤ m), which we will use to compute cosine similarities in the transformed subspace as: CS(x, y, A) =
(Ax)T (Ay) xT AT Ay = √ Ax Ay xT AT Ax y T AT Ay
Specifically, we want to learn the linear transformation that minimizes the cross validation error when similarities are measured in this way. We begin by defining the objective function.
712
1.3
H.V. Nguyen and L. Bai
Objective Function
First, we define positive and negative sample index sets P os and N eg as: P os = {i|li = 1} N eg = {i|li = 0} Also, let |P os| and |N eg| denote the numbers of positive and negative samples. We have |P os| + |N eg| = s - the total number of samples. Now the objective function f (A) can be defined as: f (A) = CS(xi , yi , A) − α CS(xi , yi , A) − β A − A0 2 i∈P os
i∈N eg
We want to maximize f (A) with regard to matrix A given two parameters α and β where α, β ≥ 0. The objective function can be split into two terms: g(A) and h(A) where CS(xi , yi, A) − α CS(xi , yi , A) g(A) = i∈P os
h(A) =β A − A0
i∈N eg 2
The role of g(A) is to encourage the margin between positive and negative samples to be large. A large margin can help to reduce the training error. g(A) can be seen as a simple voting scheme from each sample. The reason we can treat votes from samples equally is that cosine similarity function is bounded by 1. Additionally, because of this simple form of g(A), we can optimize f (A) very fast (details in section 1.4). The parameter α in g(A) is to balance the contributions of positive samples and negatives samples to the margin. In practice, α can be |P os| estimated using cross validation or simply be set to |N eg| . In the case of LFW, because the numbers of positive and negative samples are equal, we simply set α = 1. The role of h(A) is to regularize matrix A to be as close as possible to a predefined matrix A0 which can be any matrix. The idea is both to inherit good properties from matrix A0 and to reduce the training error as much as possible. If A0 is carefully chosen, the learned matrix A can achieve small training error and good generalization ability at the same time. The parameter β plays an important role here. It controls the tradeoff between maximizing the margin (g(A)) and minimizing the distance from A to A0 (h(A)). With the objective function set up, the algorithm can be presented in detail in the next section. 1.4
The Algorithm and Its Complexity
The idea is to use cross validation to estimate the optimal values of (α, β). In this paper, α can be simply set to 1 and suitable β can be found using coarse-to-fine
Cosine Similarity Metric Learning for Face Verification
713
Algorithm 1. Cosine Similarity Metric Learning INPUT – – – – –
S = {xi , yi , li }si=1 : a set of training samples (xi , yi ∈ Rm , li ∈ {0, 1}) T = {xi , yi , li }ti=1 : a set of validation samples (xi , yi ∈ Rm , li ∈ {0, 1}) d: dimension of the transformed subspace (d ≤ m) Ap : a predefined matrix (Ap ∈ Rd×m ) K: K-fold cross validation
OUTPUT - ACSM L : output transformation matrix (ACSM L ∈ Rd×m ) 1. A0 ← Ap |P os| 2. α ← |Neg| 3. Repeat (a) min cve ← ∞ // store minimum cross validation error (b) For each value of β // coarse-to-fine strategy i. A∗ ← the matrix maximizing f (A) given (A0 , α, β) evaluating on S // Algorithm 2 ii. if cve(T, A∗ , K) < min cve then A. min cve ← cve(T, A∗ , K) B. Anext ← A∗ (c) A0 ← Anext 4. Until convergence 5. ACSM L ← A0 6. Return ACSM L
Algorithm 2. Cross validation error computation INPUT – T = {xi , yi , li }ti=1 : a set of validation samples (xi , yi ∈ Rm , li ∈ {0, 1}) – A: a linear transformation matrix (A ∈ Rd×m ) – K: K-fold cross validation OUTPUT - cross validation error 1. Transform all samples in T using matrix A 2. Partition T into K equal-sized subsamples 3. total error ← 0 4. For k = 1 → K // using subsample k as testing data, the other K−1 subsamples as training data (a) θ ← the optimal threshold on training data (b) test error ← error on testing data (c) total error ← total error + test error 5. Return total error/K
714
H.V. Nguyen and L. Bai
search strategy. Coarse-to-fine means the range of searching area decreases over time. Algorithm 1 presents the proposed CSML method. It is easy to prove that when β goes to ∞, the optimized matrix A∗ approaches the prior A0 . In other words, the performance of learned matrix ACSML is guaranteed to be as good as that of matrix A0 . In practice, however, the performance of matrix ACSML is significantly better in most cases (see section 3). f (A) is differentiable with regard to matrix A so we can optimize it using a gradient based optimizer such as delta-bar-delta or conjugate gradients. We used the Conjugate Gradient method. The gradient of f (A) can be computed as follows: ∂CS(xi , yi , A) ∂CS(xi , yi , A) ∂f (A) = −α ∂A ∂A ∂A i∈P os
i∈N eg
− 2β(A − A0 ) ∂CS(xi , yi , A) = ∂A =
∂( √
(1)
T xT i A Ayi
T xT i A Axi
√
yiT AT Ayi
)
∂A ∂( u(A) v(A) )
∂A 1 ∂u(A) u(A) ∂v(A) = − v(A) ∂A v(A)2 ∂A
(2)
where ∂u(A) =A(xi yiT + yi xTi ) ∂A ∂v(A) yiT AT Ayi xT AT Axi T = Axi xi − i Ayi yiT T ∂A xi AT Axi yiT AT Ayi
(3) (4)
From Eq (1, 2, 3, 4), the complexity of computing f (A)’s gradient is O(s × d × m). As a result, the complexity of CSML algorithm is O(r × b × g × s × d × m) where r is the number of iterations used to optimize A0 repeatedly (at line 3 in Algorithm 1), b is the number of values of β tested in cross validation process (at line 3b in Algorithm 1), g is the number of steps in the Conjugate Gradient method.
2
Application to Face Verification
In this section, we show how CSML can be applied to face verification on the LFW dataset in detail. 2.1
LFW Dataset
The dataset contains more than 13, 000 images of faces collected from the web. These images have a very large degree of variability in face pose, age, expression,
Cosine Similarity Metric Learning for Face Verification
715
race and illumination. There are two evaluation settings by the authors of the LFW: the restricted and the unrestricted setting. This paper considers restricted setting. Under this setting no identity information of the faces is given. The only information available to a face verification algorithm is a pair of input images and the algorithm is expected to determine whether the pair of images come from the same person. The performance of an algorithm is measured by a 10-fold cross validation procedure. See [7] for details. There are three versions of the LFW available: original, funneled and aligned. In [14], Wolf et al. showed that the aligned version is better than funneled version at dealing with misalignment. Therefore, we are going to use the aligned version in all of our experiments. 2.2
Face Verification Pipeline
The overview of our method is presented in Figure 2. First, two original images are cropped to smaller sizes. Next some feature extraction method is used to → → − − form feature vectors ( X , Y ) from the cropped images. These vectors are passed −→ − → to PCA to get two dimension-reduced vectors (X2 , Y2 ). Then CSML is used to − → − → − → − → − → transform (X2 , Y2 ) to (X3 , Y3 ) in the final subspace. Cosine similarity between X3 − → and Y3 is the similarity score between two faces. Finally, this score is thresholded to determine whether two faces are the same or not. The optimal threshold θ is estimated from the training set. Specifically, θ is set so that False Acceptance Rate equals to False Rejection Rate. Each step will be discussed in detail. Preprocessing. The original size of each image is 250 × 250 pixels. At the preprocessing step, we simply crop the image to remove the background, leaving a 80 × 150 face image. The next step after preprocessing is to extract features from the image. Feature Extraction. To test the robustness of our method to different types of features, we carry out experiments on three facial descriptors: Intensity, Local Binary Patterns and Gabor Wavelets.
Fig. 2. Overview of Face verifiction process
716
H.V. Nguyen and L. Bai
Intensity is the simplest feature extraction method. The feature vector is formed by concatenating all the pixels. The length of the feature vector is 12, 000 (= 80 × 150). Local Binary Patterns (LBP) was first applied for Face Recognition in [15] with very promising results. In our experiments, the face is divided into nonoverlapping 10 × 10 blocks and LBP histograms are extracted in all blocks to form the feature vector whose length is 7, 080 (= 8 × 15 × 59). Gabor Wavelets [16, 17] with 5 scales and 8 orientations are convoluted at different pixels selected uniformly with the downsampling rate of 10 × 10. The length of the feature vector is 4, 800 (= 5 × 8 × 8 × 15) . Dimension Reduction. Before applying any learning method, we use PCA to reduce the dimension of the original feature vector to a more tractable number. A thousand faces from training data (different for each fold) are used to create the covariance matrix in PCA. We notice in our experiments that the specific value of the reduced dimension after applying PCA doesn’t affect the accuracy very much as long as it is not too small. Feature Combination. We can further improve the accuracy by combining different types of features. Features can be combined at the feature extraction step [18, 19] or at the verification step. Here we combine features at the verification step using SVM [20]. Applying CSML to each type of feature produces a similarity score. These scores form a vector which is passed to SVM for verification. 2.3
How to Choose A0 in CSML?
Because CSML improves the accuracy of A0 , it is a good idea to choose matrix A0 which performs well by itself. There are published papers concluding that Whitened PCA (WPCA) with Cosine Similarity can achieve very good performance [3, 21]. Therefore, we propose to use the whitening matrix as A0 . Since we reduce the dimension from m to d, the whitening matrix is in the rectangular form as follows: ⎡ −1 ⎤ λ1 2 0 ... 0 0 0 ⎢ ⎥ − 12 ⎢ ⎥ AW P CA = ⎢ 0 λ2 ... 0 0 0 ⎥ ∈ Rd×m ⎣ ... ... ... ... 0 0 ⎦ 0
− 12
0 ... λd
00
where λ1 , λ2 , ..., λd are the first d largest eigen-values of the covariance matrix computed in the PCA step. To compare, we tried two different matrices: non-whitening PCA and Random Projection. ⎡ ⎤ 1 0 ... 0 0 0 ⎢ 0 1 ... 0 0 0 ⎥ d×m ⎥ AP CA = ⎢ ⎣ ... ... ... ... 0 0 ⎦ ∈ R 0 0 ... 1 0 0 ARP = random matrix ∈ Rd×m
Cosine Similarity Metric Learning for Face Verification
3
717
Experimental Results
To evaluate performance on View 2 of the LFW dataset, we used five of the nine training splits as training samples and the remaining four as validation samples in CSML algorithm (more about the LFW protocol in [7]). These validation samples are also used for training the SVM. All results presented here are produced using the parameters: m = 500 and d = 200 where m is the dimension of the data after applying PCA and d is the dimension of the data after applying CSML. In this section, we present the results of two experiments. In this first experiment, we will show how much CSML improves over three cases of A0 : Random Projection, PCA, and Whitened PCA. We call the transformation matrices of these ARP , AP CA , and AW P CA respectively. Here we used the original intensity as the feature extraction method. As shown in table 1, ACSML consistently performs better than A0 about 5 − 10%. Table 1. ACSM L and A0 performance comparison A0 ACSM L
ARP AP CA AW P CA 0.5752 ± 0.0057 0.6762 ± 0.0075 0.7322 ± 0.0037 0.673 ± 0.0095 0.7112 ± 0.0083 0.7865 ± 0.0039
In the second experiment, we will show how much CSML improves over cosine similarity in the original space and over Whitened PCA with three types of features: Intensity (IN), Local Binary Patterns (LBP), and Gabor Wavelets (GABOR). Each type of feature is tested with the original feature vector or the square root of the feature vector [8, 14, 20]. Table 2. The improvements of CSML over cosine similarity and WPCA Cosine WPCA original 0.6567 ± 0.0071 0.7322 ± 0.0037 sqrt 0.6485 ± 0.0088 0.7243 ± 0.0038 LBP original 0.7027 ± 0.0036 0.7712 ± 0.0044 sqrt 0.6977 ± 0.0047 0.7937 ± 0.0034 GABOR original 0.672 ± 0.0053 0.7558 ± 0.0052 sqrt 0.6942 ± 0.0072 0.7698 ± 0.0056 Feature Combination IN
CSML 0.7865 ± 0.0039 0.7887 ± 0.0052 0.8295 ± 0.0052 0.8557 ± 0.0052 0.8238 ± 0.0021 0.8358 ± 0.0058 0.88 ± 0.0037
As shown in table 2, CSML improves about 5% over WPCA and about 10 − 15% over cosine similarity. LBP seems to perform better than Intensity and Gabor Wavelets. Using square root of the feature vector improves the accuracy about 2−3% in most cases. The highest accuracy we can get from a single type of feature is 0.8557 ± 0.0052 using CSML with the square root of the LBP feature. The accuracy we can get by combining 6 scores corresponding to 6 different features (in the rightmost column in table 2) is 0.88 ± 0.0038. This is better
718
H.V. Nguyen and L. Bai
Fig. 3. ROC curves averaged over 10 folds of View 2
than the current state of the art result reported in [14]. For comparison purpose, the ROC curves of our method and others are depicted in Figure 3. Complete benchmark results can be found on the LFW website [22].
4
Conclusion
We have introduced a novel method for learning a distance metric based on cosine similarity. The use of cosine similarity allows us to form a simple but effective objective function, which leads to a fast gradient-based optimization algorithm. Another important property of our method is that in theory the learned matrix cannot perform worse than the regularized matrix. In practice, it performs considerably better in most cases. We tested our method on the LFW dataset and achieved highest accuracy in the literature. Although initially CSML was designed for face verification, it has a wide range of applications, which we plan to explore in future work.
References 1. Phillips, P., Wechsler, H., Huang, J., Rauss, P.: The FERET database and evaluation procedure for face-recognition algorithms. Image and Vision Computing 16, 295–306 (1998) 2. Shan, S., Zhang, W., Su, Y., Chen, X., Gao, W., Frjdl, I., Cas, B.: Ensemble of Piecewise FDA Based on Spatial Histograms of Local (Gabor) Binary Patterns for Face Recognition. In: Proceedings of the 18th International Conference on Pattern Recognition, pp. 606–609 (2006)
Cosine Similarity Metric Learning for Face Verification
719
3. Hieu, N., Bai, L., Shen, L.: Local gabor binary pattern whitened pca: A novel approach for face recognition from single image per person. In: Proceedings of the 3rd IAPR/IEEE International Conference on Biometrics (2009) 4. Shen, L., Bai, L.: MutualBoost learning for selecting Gabor features for face recognition. Pattern Recognition Letters 27, 1758–1767 (2006) 5. Shen, L., Bai, L., Fairhurst, M.: Gabor wavelets and general discriminant analysis for face identification and verification. Image and Vision Computing 27, 1758–1767 (2006) 6. Nguyen, H.V., Bai, L.: Compact binary patterns (cbp) with multiple patch classifiers for fast and accurate face recognition. In: CompIMAGE, pp. 187–198 (2010) 7. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst (2007) 8. Guillaumin, M., Verbeek, J., Schmid, C.: Is that you? metric learning approaches for face identification. In: International Conference on Computer Vision, pp. 498– 505 (2009) 9. Taigman, Y., Wolf, L., Hassner, T.: Multiple one-shots for utilizing class label information. In: The British Machine Vision Conference, BMVC (2009) 10. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighborhood component analysis. In: NIPS 11. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems 15, vol. 15, pp. 505–512 (2003) 12. Weinberger, K., Blitzer, J., Saul, L.: Distance metric learning for large margin nearest neighbor classification. In: Advances in Neural Information Processing Systems 18, pp. 1473–1480 (2006) 13. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: ICML 2007: Proceedings of the 24th International Conference on Machine Learning, pp. 209–216. ACM, New York (2007) 14. Wolf, L., Hassner, T., Taigman, Y.: Similarity scores based on background samples. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5995, pp. 88–97. Springer, Heidelberg (2010) 15. Ahonen, T., Hadid, A., Pietikainen, M.: Face Recognition with Local Binary Patterns. In: Guessarian, I. (ed.) LITP 1990. LNCS, vol. 469, pp. 469–481. Springer, Heidelberg (1990) 16. Daugman, J.: Complete Discrete 2D Gabor Transforms by Neural Networks for Image Analysis and Compression. IEEE Trans. Acoust.Speech Signal Process. 36 (1988) 17. Shan, S., Gao, W., Chang, Y., Cao, B., Yang, P.: Review the strength of Gabor features for face recognition from the angle of its robustness to mis-alignment. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 1 (2004) 18. Tan, X., Triggs, B.: Fusing Gabor and LBP Feature Sets for Kernel-Based Face Recognition. In: Zhou, S.K., Zhao, W., Tang, X., Gong, S. (eds.) AMFG 2007. LNCS, vol. 4778, pp. 235–249. Springer, Heidelberg (2007) 19. Zhang, W., Shan, S., Gao, W., Chen, X., Zhang, H.: Local Gabor Binary Pattern Histogram Sequence (LGBPHS): A Novel Non-Statistical Model for Face Representation and Recognition. In: Proc. ICCV, pp. 786–791 (2005)
720
H.V. Nguyen and L. Bai
20. Wolf, L., Hassner, T., Taigman, Y.: Descriptor based methods in the wild. In: RealLife Images Workshop at the European Conference on Computer Vision, ECCV (2008) 21. Deng, W., Hu, J., Guo, J.: Gabor-Eigen-Whiten-Cosine: A Robust Scheme for Face Recognition. In: Zhao, W., Gong, S., Tang, X. (eds.) AMFG 2005. LNCS, vol. 3723, pp. 336–349. Springer, Heidelberg (2005) 22. http://vis-www.cs.umass.edu/lfw/results.html: (LFW benchmark results)
Author Index
Abdala, Daniel Duarte IV-373 Abe, Daisuke IV-565 Achtenberg, Albert IV-141 Ackermann, Hanno II-464 Agapito, Lourdes IV-460 Ahn, Jae Hyun IV-513 Ahuja, Narendra IV-501 Ai, Haizhou II-174, II-683, III-171 Akbas, Emre IV-501 Alexander, Andrew L. I-65 An, Yaozu II-282 Ancuti, Codruta O. I-79, II-501 Ancuti, Cosmin I-79, II-501 Argyros, Antonis A. III-744 Arnaud, Elise IV-361 ˚ Astr¨ om, Kalle IV-255 Atasoy, Selen II-41 Azuma, Takeo III-641 Babari, Raouf IV-243 Badino, Hern´ an III-679 Badrinath, G.S. II-321 Bai, Li II-709 Ballan, Luca III-613 Barlaud, Michel III-67 Barnes, Nick I-176, IV-269, IV-410 Bartoli, Adrien III-52 Bekaert, Philippe I-79, II-501 Belhumeur, Peter N. I-39 Bennamoun, Mohammed III-199, IV-115 Ben-Shahar, Ohad II-346 Ben-Yosef, Guy II-346 Binder, Alexander III-95 Bischof, Horst I-397, II-566 Bishop, Tom E. II-186 Biswas, Sujoy Kumar I-244 Bonde, Ujwal D. IV-228 Bowden, Richard I-256, IV-525 Boyer, Edmond IV-592 Br´emond, Roland IV-243 Briassouli, Alexia I-149 Brocklehurst, Kyle III-329, IV-422 Brunet, Florent III-52
Bu, Jiajun III-436 Bujnak, Martin I-11, II-216 Burschka, Darius I-135, IV-474 Byr¨ od, Martin IV-255 Carlsson, Stefan II-1 Cha, Joonhyuk IV-486 Chan, Kwok-Ping IV-51 Chen, Chia-Ping I-355 Chen, Chun III-436 Chen, Chu-Song I-355 Chen, Duowen I-283 Chen, Kai III-121 Chen, Tingwang II-400 Chen, Wei II-67 Chen, Yan Qiu IV-435 Chen, Yen-Wei III-511, IV-39, IV-165 Chen, Yi-Ling III-535 Chen, Zhihu II-137 Chi, Yu-Tseh II-268 Chia, Liang-Tien II-515 Chin, Tat-Jun IV-553 Cho, Nam Ik IV-513 Chu, Wen-Sheng I-355 Chum, Ondˇrej IV-347 Chung, Ronald H-Y. IV-690 Chung, Sheng-Luen IV-90 Collins, Maxwell D. I-65 Collins, Robert T. III-329, IV-422 Cootes, Tim F. I-1 Cosker, Darren IV-189 Cowan, Brett R. IV-385 Cree, Michael J. IV-397 Cremers, Daniel I-53 Dai, Qionghai II-412 Dai, Zhenwen II-137 Danielsson, Oscar II-1 Davis, James W. II-580 de Bruijne, Marleen II-160 Declercq, Arnaud III-422 Deguchi, Koichiro IV-565 De la Torre, Fernando III-679 Delaunoy, Ama¨el I-39, II-55 Denzler, Joachim II-489
722
Author Index
De Smet, Micha¨el III-276 Detry, Renaud III-572 Dickinson, Sven I-369, IV-539 Dikmen, Mert IV-501 Ding, Jianwei II-82 Di Stefano, Luigi III-653 Dorothy, Monekosso I-439 Dorrington, Adrian A. IV-397 Duan, Genquan II-683 ´ Dumont, Eric IV-243 El Ghoul, Aymen II-647 Ellis, Liam IV-525 Eng, How-Lung I-439 Er, Guihua II-412 Fan, Yong IV-606 Fang, Tianhong II-633 Favaro, Paolo I-425, II-186 Felsberg, Michael IV-525 Feng, Jufu III-213, III-343 Feng, Yaokai I-296 Feragen, Aasa II-160 Feuerstein, Marco III-409 Fieguth, Paul I-383 F¨ orstner, Wolfgang II-619 Franco, Jean-S´ebastien III-599 Franek, Lucas II-697, IV-373 Fu, Yun II-660 Fujimura, Ikko I-296 Fujiyoshi, Hironobu IV-25 Fukuda, Hisato IV-127 Fukui, Kazuhiro IV-580 Furukawa, Ryo IV-127 Ganesh, Arvind III-314, III-703 Gao, Changxin III-133 Gao, Yan IV-153 Garg, Ravi IV-460 Geiger, Andreas I-25 Georgiou, Andreas II-41 Ghahramani, M. II-388 Gilbert, Andrew I-256 Godbaz, John P. IV-397 Gong, Haifeng II-254 Gong, Shaogang I-161, II-293, II-527 Gopalakrishnan, Viswanath II-15, III-732 Grabner, Helmut I-200 Gu, Congcong III-121
Gu, Steve I-271 Guan, Haibing III-121 Guo, Yimo III-185 Gupta, Phalguni II-321 Hall, Peter IV-189 Han, Shuai I-323 Hao, Zhihui IV-269 Hartley, Richard II-554, III-52, IV-177, IV-281 Hauberg, Søren III-758 Hauti´ere, Nicolas IV-243 He, Hangen III-27 He, Yonggang III-133 Helmer, Scott I-464 Hendel, Avishai III-448 Heo, Yong Seok IV-486 Hermans, Chris I-79, II-501 Ho, Jeffrey II-268 Horaud, Radu IV-592 Hospedales, Timothy M. II-293 Hou, Xiaodi III-225 Hsu, Gee-Sern IV-90 Hu, Die II-672 Hu, Tingbo III-27 Hu, Weiming II-594, III-691, IV-630 Hu, Yiqun II-15, II-515, III-732 Huang, Jia-Bin III-497 Huang, Kaiqi II-67, II-82, II-542 Huang, Thomas S. IV-501 Huang, Xinsheng IV-281 Huang, Yongzhen II-542 Hung, Dao Huu IV-90 Igarashi, Yosuke IV-580 Ikemura, Sho IV-25 Iketani, Akihiko III-109 Imagawa, Taro III-641 Iwama, Haruyuki IV-702 Jank´ o, Zsolt II-55 Jeon, Moongu III-718 Jermyn, Ian H. II-647 Ji, Xiangyang II-412 Jia, Ke III-586 Jia, Yunde II-254 Jiang, Hao I-228 Jiang, Mingyang III-213, III-343 Jiang, Xiaoyi II-697, IV-373 Jung, Soon Ki I-478
Author Index Kakadiaris, Ioannis A. II-633 Kale, Amit IV-592 Kambhamettu, Chandra III-82, III-483, III-627 Kanatani, Kenichi II-242 Kaneda, Kazufumi II-452, III-250 Kang, Sing Bing I-350 Kasturi, Rangachar II-308 Kawabata, Satoshi III-523 Kawai, Yoshihiro III-523 Kawanabe, Motoaki III-95 Kawano, Hiroki I-296 Kawasaki, Hiroshi IV-127 Kemmler, Michael II-489 Khan, R. Nazim III-199 Kikutsugi, Yuta III-250 Kim, Du Yong III-718 Kim, Hee-Dong IV-1 Kim, Hyunwoo IV-333 Kim, Jaewon I-336 Kim, Seong-Dae IV-1 Kim, Sujung IV-1 Kim, Tae-Kyun IV-228 Kim, Wook-Joong IV-1 Kise, Koichi IV-64 Kitasaka, Takayuki III-409 Klinkigt, Martin IV-64 Kompatsiaris, Ioannis I-149 Kopp, Lars IV-255 Kuang, Gangyao I-383 Kuang, Yubin IV-255 Kuk, Jung Gap IV-513 Kukelova, Zuzana I-11, II-216 Kulkarni, Kaustubh IV-592 Kwon, Dongjin I-121 Kyriazis, Nikolaos III-744 Lai, Shang-Hong III-535 Lam, Antony III-157 Lao, Shihong II-174, II-683, III-171 Lauze, Francois II-160 Lee, Kyong Joon I-121 Lee, Kyoung Mu IV-486 Lee, Sang Uk I-121, IV-486 Lee, Sukhan IV-333 Lei, Yinjie IV-115 Levinshtein, Alex I-369 Li, Bing II-594, III-691, IV-630 Li, Bo IV-385 Li, Chuan IV-189
Li, Chunxiao III-213, III-343 Li, Fajie IV-641 Li, Hongdong II-554, IV-177 Li, Hongming IV-606 Li, Jian II-293 Li, Li III-691 Li, Min II-67, II-82 Li, Sikun III-471 Li, Wei II-594, IV-630 Li, Xi I-214 Li, Yiqun IV-153 Li, Zhidong II-606, III-145 Liang, Xiao III-314 Little, James J. I-464 Liu, Jing III-239 Liu, Jingchen IV-102 Liu, Li I-383 Liu, Miaomiao II-137 Liu, Nianjun III-586 Liu, Wei IV-115 Liu, Wenyu III-382 Liu, Yanxi III-329, IV-102, IV-422 Liu, Yong III-679 Liu, Yonghuai II-27 Liu, Yuncai II-660 Llad´ o, X. III-15 Lo, Pechin II-160 Lovell, Brian C. III-547 Lowe, David G. I-464 Loy, Chen Change I-161 Lu, Feng II-412 Lu, Guojun IV-449 Lu, Hanqing III-239 Lu, Huchuan III-511, IV-39, IV-165 Lu, Shipeng IV-165 Lu, Yao II-282 Lu, Yifan II-554, IV-177 Lu, Zhaojin IV-333 Luo, Guan IV-630 Lu´ o, Xi´ ongbi¯ ao III-409 Luo, Ye III-396 Ma, Songde III-239 Ma, Yi III-314, III-703 MacDonald, Bruce A. II-334 MacNish, Cara III-199 Macrini, Diego IV-539 Mahalingam, Gayathri III-82 Makihara, Yasushi I-107, II-440, III-667, IV-202, IV-702
723
724
Author Index
Makris, Dimitrios III-262 Malgouyres, Remy III-52 Mannami, Hidetoshi II-440 Martin, Ralph R. II-27 Matas, Jiˇr´ı IV-347 Matas, Jiri III-770 Mateus, Diana II-41 Matsushita, Yasuyuki I-336, III-703 Maturana, Daniel IV-618 Mauthner, Thomas II-566 McCarthy, Chris IV-410 Meger, David I-464 Mehdizadeh, Maryam III-199 Mery, Domingo IV-618 Middleton, Lee I-200 Moon, Youngsu IV-486 Mori, Atsushi I-107 Mori, Kensaku III-409 Muja, Marius I-464 Mukaigawa, Yasuhiro I-336, III-667 Mukherjee, Dipti Prasad I-244 Mukherjee, Snehasis I-244 M¨ uller, Christina III-95 Nagahara, Hajime III-667, IV-216 Nakamura, Ryo II-109 Nakamura, Takayuki IV-653 Navab, Nassir II-41, III-52 Neumann, Lukas III-770 Nguyen, Hieu V. II-709 Nguyen, Tan Dat IV-665 Nielsen, Frank III-67 Nielsen, Mads II-160 Niitsuma, Hirotaka II-242 Nock, Richard III-67 Oikonomidis, Iasonas III-744 Okabe, Takahiro I-93, I-323 Okada, Yusuke III-641 Okatani, Takayuki IV-565 Okutomi, Masatoshi III-290, IV-76 Okwechime, Dumebi I-256 Ommer, Bj¨ orn II-477 Ong, Eng-Jon I-256 Ortner, Mathias IV-361 Orwell, James III-262 Oskarsson, Magnus IV-255 Oswald, Martin R. I-53 Ota, Takahiro IV-653
Paisitkriangkrai, Sakrapee III-460 Pajdla, Tomas I-11, II-216 Pan, ChunHong II-148, III-560 Pan, Xiuxia IV-641 Paparoditis, Nicolas IV-243 Papazov, Chavdar I-135 Park, Minwoo III-329, IV-422 Park, Youngjin III-355 Pedersen, Kim Steenstrup III-758 Peleg, Shmuel III-448 Peng, Xi I-283 Perrier, R´egis IV-361 Piater, Justus III-422, III-572 Pickup, David IV-189 Pietik¨ ainen, Matti III-185 Piro, Paolo III-67 Pirri, Fiora III-369 Pizarro, Luis IV-460 Pizzoli, Matia III-369 Pock, Thomas I-397 Pollefeys, Marc III-613 Prados, Emmanuel I-39, II-55 Provenzi, E. III-15 Qi, Baojun
III-27
Rajan, Deepu II-15, II-515, III-732 Ramakrishnan, Kalpatti R. IV-228 Ranganath, Surendra IV-665 Raskar, Ramesh I-336 Ravichandran, Avinash I-425 Ray, Nilanjan III-39 Raytchev, Bisser II-452, III-250 Reddy, Vikas III-547 Reichl, Tobias III-409 Remagnino, Paolo I-439 Ren, Zhang I-176 Ren, Zhixiang II-515 Rodner, Erik II-489 Rohith, MV III-627 Rosenhahn, Bodo II-426, II-464 Roser, Martin I-25 Rosin, Paul L. II-27 Roth, Peter M. II-566 Rother, Carsten I-53 Roy-Chowdhury, Amit K. III-157 Rudi, Alessandro III-369 Rudoy, Dmitry IV-307 Rueckert, Daniel IV-460 Ruepp, Oliver IV-474
Author Index Sagawa, Ryusuke III-667 Saha, Baidya Nath III-39 Sahbi, Hichem I-214 Sakaue, Fumihiko II-109 Sala, Pablo IV-539 Salti, Samuele III-653 Salvi, J. III-15 Sanderson, Conrad III-547 Sang, Nong III-133 Sanin, Andres III-547 Sankaranarayananan, Karthik II-580 Santner, Jakob I-397 ˇ ara, Radim I-450 S´ Sato, Imari I-93, I-323 Sato, Jun II-109 Sato, Yoichi I-93, I-323 Savoye, Yann III-599 Scheuermann, Bj¨ orn II-426 Semenovich, Dimitri I-490 Senda, Shuji III-109 Shah, Shishir K. II-230, II-633 Shahiduzzaman, Mohammad IV-449 Shang, Lifeng IV-51 Shelton, Christian R. III-157 Shen, Chunhua I-176, III-460, IV-269, IV-281 Shi, Boxin III-703 Shibata, Takashi III-109 Shimada, Atsushi IV-216 Shimano, Mihoko I-93 Shin, Min-Gil IV-293 Shin, Vladimir III-718 Sigal, Leonid III-679 Singh, Vikas I-65 Sminchisescu, Cristian I-369 Somanath, Gowri III-483 Song, Li II-672 Song, Ming IV-606 Song, Mingli III-436 Song, Ran II-27 ´ Soto, Alvaro IV-618 Sowmya, Arcot I-490, II-606 Stol, Karl A. II-334 Sturm, Peter IV-127, IV-361 Su, Hang III-302 Su, Te-Feng III-535 Su, Yanchao II-174 Sugimoto, Shigeki IV-76 Sung, Eric IV-11 Suter, David IV-553
Swadzba, Agnes II-201 Sylwan, Sebastian I-189 Szir´ anyi, Tam´ as IV-321 Szolgay, D´ aniel IV-321 Tagawa, Seiichi I-336 Takeda, Takahishi II-452 Takemura, Yoshito II-452 Tamaki, Toru II-452, III-250 Tan, Tieniu II-67, II-82, II-542 Tanaka, Masayuki III-290 Tanaka, Shinji II-452 Taneja, Aparna III-613 Tang, Ming I-283 Taniguchi, Rin-ichiro IV-216 Tao, Dacheng III-436 Teoh, E.K. II-388 Thida, Myo I-439 Thomas, Stephen J. II-334 Tian, Qi III-239, III-396 Tian, Yan III-679 Timofte, Radu I-411 Tomasi, Carlo I-271 Tombari, Federico III-653 T¨ oppe, Eno I-53 Tossavainen, Timo III-1 Trung, Ngo Thanh III-667 Tyleˇcek, Radim I-450 Uchida, Seiichi I-296 Ugawa, Sanzo III-641 Urtasun, Raquel I-25 Vakili, Vida II-123 Van Gool, Luc I-200, I-411, III-276 Vega-Pons, Sandro IV-373 Veksler, Olga II-123 Velastin, Sergio A. III-262 Veres, Galina I-200 Vidal, Ren´e I-425 Wachsmuth, Sven II-201 Wada, Toshikazu IV-653 Wagner, Jenny II-477 Wang, Aiping III-471 Wang, Bo IV-269 Wang, Hanzi IV-630 Wang, Jian-Gang IV-11 Wang, Jinqiao III-239 Wang, Lei II-554, III-586, IV-177
725
726
Author Index
Wang, LingFeng II-148, III-560 Wang, Liwei III-213, III-343 Wang, Nan III-171 Wang, Peng I-176 Wang, Qing II-374, II-400 Wang, Shao-Chuan I-310 Wang, Wei II-95, III-145 Wang, Yang II-95, II-606, III-145 Wang, Yongtian III-703 Wang, Yu-Chiang Frank I-310 Weinshall, Daphna III-448 Willis, Phil IV-189 Wojcikiewicz, Wojciech III-95 Won, Kwang Hee I-478 Wong, Hoi Sim IV-553 Wong, Kwan-Yee K. II-137, IV-690 Wong, Wilson IV-115 Wu, HuaiYu III-560 Wu, Lun III-703 Wu, Ou II-594 Wu, Tao III-27 Wu, Xuqing II-230 Xiang, Tao I-161, II-293, II-527 Xiong, Weihua II-594 Xu, Changsheng III-239 Xu, Dan II-554, IV-177 Xu, Jie II-95, II-606, III-145 Xu, Jiong II-374 Xu, Zhengguang III-185 Xue, Ping III-396 Yaegashi, Keita II-360 Yagi, Yasushi I-107, I-336, II-440, III-667, IV-202, IV-702 Yamaguchi, Takuma IV-127 Yamashita, Takayoshi II-174 Yan, Ziye II-282 Yanai, Keiji II-360 Yang, Chih-Yuan III-497 Yang, Ehwa III-718 Yang, Fan IV-39 Yang, Guang-Zhong II-41 Yang, Hua III-302 Yang, Jie II-374 Yang, Jun II-95, II-606, III-145 Yang, Ming-Hsuan II-268, III-497 Yau, Wei-Yun IV-11 Yau, W.Y. II-388 Ye, Getian II-95
Yin, Fei III-262 Yoo, Suk I. III-355 Yoon, Kuk-Jin IV-293 Yoshida, Shigeto II-452 Yoshimuta, Junki II-452 Yoshinaga, Satoshi IV-216 Young, Alistair A. IV-385 Yu, Jin IV-553 Yu, Zeyun II-148 Yuan, Chunfeng III-691 Yuan, Junsong III-396 Yuk, Jacky S-C. IV-690 Yun, Il Dong I-121 Zappella, L. III-15 Zeevi, Yehoshua Y. IV-141 Zelnik-Manor, Lihi IV-307 Zeng, Liang III-471 Zeng, Zhihong II-633 Zerubia, Josiane II-647 Zhang, Bang II-606 Zhang, Chunjie III-239 Zhang, Dengsheng IV-449 Zhang, Hong III-39 Zhang, Jian III-460 Zhang, Jing II-308 Zhang, Liqing III-225 Zhang, Luming III-436 Zhang, Wenling III-511 Zhang, Xiaoqin IV-630 Zhang, Zhengdong III-314 Zhao, Guoying III-185 Zhao, Xu II-660 Zhao, Youdong II-254 Zheng, Hong I-176 Zheng, Huicheng IV-677 Zheng, Qi III-121 Zheng, Shibao III-302 Zheng, Wei-Shi II-527 Zheng, Ying I-271 Zheng, Yinqiang IV-76 Zheng, Yongbin IV-281 Zhi, Cheng II-672 Zhou, Bolei III-225 Zhou, Quan III-382 Zhou, Yi III-121 Zhou, Yihao IV-435 Zhou, Zhuoli III-436 Zhu, Yan II-660