Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5994
Hongbin Zha Rin-ichiro Taniguchi Stephen Maybank (Eds.)
Computer Vision – ACCV 2009 9th Asian Conference on Computer Vision Xi’an, September 23-27, 2009 Revised Selected Papers, Part I
13
Volume Editors Hongbin Zha Peking University Department of Machine Intelligence Beijing, 100871, China E-mail:
[email protected] Rin-ichiro Taniguchi Kyushu University Department of Advanced Information Technology Fukuoka, 819-0395, Japan E-mail:
[email protected] Stephen Maybank University of London Birkbeck College, Department of Computer Science London, WC1E 7HX, UK E-mail:
[email protected]
Library of Congress Control Number: 2010923506 CR Subject Classification (1998): I.4, I.5, I.2.10, I.2.6, I.3.5, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-642-12306-6 Springer Berlin Heidelberg New York 978-3-642-12306-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
It gives us great pleasure to present the proceedings of the 9th Asian Conference on Computer Vision (ACCV 2009), held in Xi’an, China, in September 2009. This was the first ACCV conference to take place in mainland China. We received a total of 670 full submissions, which is a new record in the ACCV series. Overall, 35 papers were selected for oral presentation and 131 as posters, yielding acceptance rates of 5.2% for oral, 19.6% for poster, and 24.8% in total. In the paper reviewing, we continued the tradition of previous ACCVs by conducting the process in a double-blind manner. Each of the 33 Area Chairs received a pool of about 20 papers and nominated a number of potential reviewers for each paper. Then, Program Committee Chairs allocated at least three reviewers to each paper, taking into consideration any conflicts of interest and the balance of loads. Once the reviews were finished, the Area Chairs made summary reports for the papers in their pools, based on the reviewers’ comments and on their own assessments of the papers. The Area Chair meeting was held at Peking University, Beijing during July 6–7, 2009. Thirty-one Area Chairs attended the meeting. They were divided into eight groups. The reviews and summary reports for the papers were discussed within the groups, in order to establish the scientific contribution of each paper. Area Chairs were permitted to confer with pre-approved “consulting” Area Chairs outside their groups if needed. The final acceptance decisions were made at a meeting of all the Area Chairs. Finally, the Program Chairs drew up a single-track technical program which consisted of 12 oral sessions and three poster sessions for the three-day conference. We are glad to see that all of the oral speakers presented their papers at the conference. The program included three plenary sessions in which world-leading researchers, Roberto Cipolla (University of Cambridge), Larry S. Davis (University of Maryland), and Long Quan (Hong Kong University of Science and Technology), gave their talks. We would like to thank them for their respective presentations on 3D shape acquisition, human tracking and image-based modeling, which were both inspiring and entertaining. A conference like ACCV 2009 would not be possible without the concerted effort of many people and the support of various institutions. We would like to thank the ACCV 2009 Area Chairs and members of the Technical Program Committee for their time and effort spent in reviewing the submissions. The local arrangement team, led by Yanning Zhang, did a terrific job in organizing the conference. We also thank Katsushi Ikeuchi, Tieniu Tan, and Yasushi Yagi, whose help was critical at many stages of the conference organization. Last but
VI
Preface
not least, we would like to thank all of the attendees of the conference. Due to their active participation, this was one of the most successful conferences in the history of the ACCV series. December 2009
Hongbin Zha Rin-ichiro Taniguchi Stephen Maybank
Organization
Honorary Chairs
Yunhe Pan (Chinese Academy of Engineering, China) Songde Ma (Institute of Automation, Chinese Academy of Science, China) Katsushi Ikeuchi (University of Tokyo, Japan)
General Chairs
Tieniu Tan (Institute of Automation, Chinese Academy of Science, China) Nanning Zheng (Xi’an Jiaotong University, China) Yasushi Yagi (Osaka University, Japan)
Program Chairs
Hongbin Zha (Peking University, China) Rin-ichiro Taniguchi (Kyushu University, Japan) Stephen Maybank (University of London, UK)
Organization Chairs
Yanning Zhang (Northwestern Polytechnical University, China) Jianru Xue (Xi’an Jiaotong University, China)
Workshop Chairs
Octavia Camps (Northeastern University, USA) Yasuyuki Matsushita (Microsoft Research Asia, China)
Tutorial Chairs
Yunde Jia (Beijing Institute of Technology, China)
Demo Chairs
Dacheng Tao (Nanyang Technological University, Singapore)
Publication Chairs
Ying Li (Northwestern Polytechnical University, China) Kaiqi Huang (Institute of Automation, Chinese Academy of Science, China)
VIII
Organization
Publicity Chairs
Bin Luo (Anhui University, China) Chil-Woo Lee (Chonnam National University, Korea) Hichem Sahli (Vrije University Brussel, Belgium)
Area Chairs
Noboru Babaguchi (Osaka University) Horst Bischof (Technical University Graz) Chu-Song Chen (Institute of Information Science, Academia Sinica) Jan-Michael Frahm (University of North Carolina at Chapel Hill) Pascal Fua (EPFL: Ecole Polytechnique F´ed´erale de Lausanne) Dewen Hu (National University of Defense Technology) Zhanyi Hu (Institute of Automation, Chinese Academy of Science) Yi-ping Hung (National Taiwan University) Ron Kimmel (Technion - Israel Institute of Technology) Reinhard Klette (University of Auckland) Takio Kurita (National Institute of Advanced Industrial Science and Technology) Chil-Woo Lee (Chonnam National University) Kyoung Mu Lee (Seoul National University) Fei-Fei Li (Stanford University) Zhouchen Lin (Microsoft Research Asia) Kai-Kuang Ma (Nanyang Technological University) P.J. Narayanan (International Institute of Information Technology, Hyderabad) Nassir Navab (Technische Universit¨ at M¨ unchen) Tomas Pajdla (Czech Technical University in Prague) Robert Pless (Washington University) Long Quan (The Hong Kong University of Science and Technology) Jim Rehg (Georgia Institute of Technology) Ian Reid (Oxford University) Wildes Richard (York University) Hideo Saito (Keio University) Nicu Sebe (University of Amsterdam) Peter Sturm (INRIA)
Organization
IX
Akihiro Sugimoto (National Institute of Informatics) David Suter (University of Adelaide) Guangyou Xu (Tsinghua University) Yaser Yacoob (University of Maryland) Ming-Hsuan Yang (University of California at Merced) Hong Zhang (University of Alberta) Committee Members
Zhonghua Fu (Northwestern Polytechnical University, China) Dongmei Jiang (Northwestern Polytechnical University, China) Kuizhi Mei (Xi’an Jiaotong University, China) Yuru Pei (Peking University, China) Jinqiu Sun (Northwestern Polytechnical University, China) Fei Wang (Xi’an Jiaotong University, China) Huai-Yu Wu (Peking University, China) Runping Xi (Northwestern Polytechnical University, China) Lei Xie (Northwestern Polytechnical University, China) Xianghua Ying (Peking University, China) Gang Zeng (Peking University, China) Xinbo Zhao (Northwestern Polytechnical University, China) Jiangbin Zheng (Northwestern Polytechnical University, China)
Reviewers Abou Moustafa Karim Achard Catherine Ai Haizhou Alahari Karteek Allili Mohand Said Andreas Koschan Aoki Yoshmitsu Argyros Antonis Arica Nafiz Ariki Yasuo Arslan Abdullah August Jonas Awate Suyash
Azevedo-Marques Paulo Bagdanov Andrew Bai Xiang Bajcsy Peter Baltes Jacky Banerjee Subhashis Barbu Adrian Barnes Nick Barreto Joao Bartoli Adrien Baudrier Etienne Baust Maximilian Beichel Reinhard
Beng-Jin Andrew Teoh Benhimane Selim Benosman Ryad Bibby Charles Bicego Manuele Blekas Konstantinos Bo Liefeng Bors Adrian Boshra Michael Bouaziz Sofien Bouguila Nizar Boutemedjet Sabri Branzan Albu Alexandra
X
Organization
Bremond Francois Bronstein Alex Bronstein Michal Brown Matthew Brown Michael Brun Luc Buckley Michael Caballero Rodrigo Caglioti Vincenzo Cagniart Cedric Camastra Francesco Cao Liangliang Cao Liangliang Carneiro Gustavo Carr Peter Castellani Umberto Cattin Philippe Celik Turgay Chan Kap Luk Chandran Sharat Chellappa Rama Chen Haifeng Chen Hwann-Tzong Chen Jiun-Hung Chen Jun-Cheng Chen Ling Chen Pei Chen Robin Bing-Yu Chen Wei-Chao Chen Xilin Chen Yixin Cheng Jian Cheng Jian Cheng Shyi-Chyi Chetverikov Dmitry Chia Liang-Tien Chien Shao-Yi Chin Tat-jun Chu Wei-Ta Chuang Yung-Yu Chung Albert Civera Javier Clipp Brian Coleman Sonya Costeira Joao Paulo
Cousty Jean Csaba Beleznai Dang Xin Daras Petros De La Torre Fernando Deguchi Koichiro Demirci Fatih Demirdjian David Deng Hongli Deniz Oscar Denzler Joachim Derpanis Konstantinos Derrode Stephane Destefanis Eduardo Dick Anthony Didas Stephan Dong qiulei Donoser Michael Doretto Gianfranco Drbohlav Ondrej Drost Bertram Duan Fuqing Dueck Delbert Duric Zoran Dutagaci Helin Dutta Roy Sumantra Dutta Roy Dvornychenko Vladimir Dyer Charles Eckhardt Ulrich Eigensatz Michael Einhauser Wolfgang Eroglu Erdem Cigdem Escolano Francisco Fan Quanfu Fang Wen-Pinn Farenzena MIchela Fasel Beat Feng Jianjiang Feris Rogerio Ferri Francesc Fidler Sanja Fihl Preben Filliat David Flitti Farid
Floery Simon Forstner Wolfgang Franco Jean-Sebastien Fraundorfer Friedrich Fritz Mario Frucci Maria Fu Chi-Wing Fuh Chiou-Shann Fujiyoshi Hironobu Fukui Kazuhiro Fumera Giorgio Furst Jacob Furukawa Yasutaka Fusiello Andrea Gall Juergen Gallagher Andrew Gang Li Garg Kshitiz Georgel Pierre Gertych Arkadiusz Gevers Theo Gherardi Riccardo Godil Afzal Goecke Roland Goshtasby A. Gou Gangpeng Grabner Helmut Grana Costantino Guerrero Josechu Guest Richard Guliato Denise Guo Feng Guo Guodong Gupta Abhinav Gupta Mohit Hadjileontiadis Leontios Hamsici Onur Han Bohyung Han Chin-Chuan Han Joon Hee Hanbury Allan Hao Wei Hassab Elgawi Osman Hautiere Nicolas He Junfeng
Organization
Heitz Fabrice Hinterstoisser Stefan Ho Jeffrey Holzer Stefan Hong Hyun Ki Hotta Kazuhiro Hotta Seiji Hou Zujun Hsiao JenHao Hsu Pai-Hui Hsu Winston Hu Qinghua Hu Weimin Hu Xuelei Hu Yiqun Hu Yu-Chen Hua Xian-Sheng Huang Fay Huang Kaiqi Huang Peter Huang Tz-Huan Huang Xiangsheng Huband Jacalyn Huele Ruben Hung Hayley Hung-Kuo Chu James Huynh Cong Iakovidis Dimitris Ieng Sio Hoi Ilic Slobodan Imiya Atsushi Inoue Kohei Irschara Arnold Ishikawa Hiroshi Iwashita Yumi Jaeger Stefan Jafari Khouzani Kourosh Jannin Pierre Jawahar C. V. Jean Frederic Jia Jiaya Jia Yunde Jia Zhen Jiang Shuqiang Jiang Xiaoyi
Jin Lianwen Juneho Yi Jurie Frederic Kagami Shingo Kale Amit Kamberov George Kankanhalli Mohan Kato Takekazu Kato Zoltan Kawasaki Hiroshi Ke Qifa Keil Andreas Keysers Daniel Khan Saad-Masood Kim Hansung Kim Kyungnam Kim Tae Hoon Kim Tae-Kyun Kimia Benjamin Kitahara Itaru Koepfler Georges Koeser Kevin Kokkinos Iasonas Kolesnikov Alexander Kong Adams Konolige Kurt Koppal Sanjeev Kotsiantis Sotiris Kr Er Norbert Kristan Matej Kuijper Arjan Kukelova Zuzana Kulic Dana Kulkarni Kaustubh Kumar Ram Kuno Yoshinori Kwolek Bogdan Kybic Jan Ladikos Alexander Lai Shang-Hong Lao Shihong Lao Zhiqiang Lazebnik Svetlana Le Duy-Dinh Le Khoa
XI
Lecue Guillaume Lee Hyoung-Joo Lee Ken-Yi Lee Kyong Joon Lee Sang-Chul Leonardo Bocchi Lepetit Vincent Lerasle Frederic Li Baihua Li Bo Li Hongdong Li Teng Li Xi Li Yongmin Liao T.Warren Lie Wen-Nung Lien Jenn-Jier James Lim Jongwoo Lim Joo-Hwee Lin Dahua Lin Huei-Yung Lin Ruei-Sung Lin Wei-Yang Lin Wen Chieh (Steve) Ling Haibin Liu Jianzhuang Liu Ming-Yu Liu Qingshan Liu Qingzhong Liu Tianming Liu Tyng-Luh Liu Xiaoming Liu Xiuwen Liu Yuncai Lopez-Nicolas Gonzalo Lu Juwei Lu Le Luo Jiebo Ma Yong Macaire Ludovic Maccormick John Madabhushi Anant Manabe Yoshitsugu Manniesing Rashindra Marchand Eric
XII
Organization
Marcialis Gian-Luca Martinet Jean Martinez Aleix Masuda Takeshi Mauthner Thomas McCarthy Chris McHenry Kenton Mei Christopher Mei Tao Mery Domingo Mirmehdi Majid Mitra Niloy Mittal Anurag Miyazaki Daisuke Moeslund Thomas Monaco Francisco Montiel Jose Mordohai Philippos Moreno Francesc Mori Kensaku Moshe Ben-Ezra Mudigonda Pawan Mueller Henning Murillo Ana Cris Naegel Benoit Nakajima Shin-ichi Namboodiri Anoop Nan Xiaofei Nanni Loris Narasimhan Srinivasa Nevatia Ram Ng Wai-Seng Nguyen Minh Hoai Nozick Vincent Odone Francesca Ohnishi Naoya Okatani Takayuki Okuma Kenji Omachi Shinichiro Pack Gary Palagyi Kalman Pan ChunHong Pankanti Sharath Paquet Thierry Park In Kyu
Park Jong-Il Park Rae-Hong Passat Nicolas Patras Yiannis Patwardhan Kedar Peers Pieter Peleg Shmuel Pernici Federico Pilet Julien Pless Robert Pock Thomas Prati Andrea Prevost Lionel Puig Luis Qi Guojun Qian Zhen Radeva Petia Rajashekar Umesh Ramalingam Srikumar Ren Xiaofeng Reyes Edel Garcia Reyes Aldasoro Constantino Ribeiro Eraldo Robles-Kelly Antonio Rosenhahn Bodo Rosman Guy Ross Arun Roth Peter Rugis John Ruvolo Paul Sadri Javad Saffari Amir Sagawa Ryusuke Salzmann Mathieu Sang Nong Santner Jakob Sappa Angel Sara Radim Sarkis Michel Sato Jun Schmid Natalia Schroff Florian Shahrokni Ali Shan Shiguang
Shen Chunhua Shi Guangchuan Shih Sheng-Wen Shimizu Ikuko Shimshoni Ilan Sigal Leonid Singhal Nitin Sinha Sudipta Snavely Noah Sommerlade Eric Steenstrup Pedersen Kim Sugaya Yasuyuki Sukno Federico Sumi Yasushi Sun Feng-Mei Sun Weidong Svoboda Tomas Swaminathan Rahul Takamatsu Jun Tan Ping Tan Robby Tang Chi-Keung Tang Ming Teng Fei Tian Jing Tian Yingli Tieu Kinh Tobias Reichl Toews Matt Toldo Roberto Tominaga Shoji Torii Akihiko Tosato Diego Trobin Werner Tsin Yanghai Tu Jilin Tuzel Oncel Uchida Seiichi Urahama K Urschler Martin Van den Hengel Anton Vasseur Pascal Veeraraghavan Ashok Veksler Olga Vitria Jordi
Organization
Wagan Asim Wang Hanzi Wang Hongcheng Wang Jingdong Wang Jue Wang Meng Wang Sen Wang Yunhong Wang Zhi-Heng Wei Hong Whitehill Jacob Wilburn Bennett Woehler Christian Wolf Christian Woo Young Woon Wu Fuchao Wu Hao Wu Huai-Yu Wu Jianxin Wu Yihong
Xiong Ziyou Xu Ning Xue Jianru Xue Jianxia Yan Shuicheng Yanai Keiji Yang Herbert Yang Ming-Hsuan Yao Yi Yaron Caspi Yeh Che-Hua Yilmaz Alper Yin Pei Yu Tianli Yu Ting Yuan Baozong Yuan Lu Zach Christopher Zha Zheng-Jun Zhang Changshui
XIII
Zhang Guangpeng Zhang Hongbin Zhang Li Zhang Liqing Zhang Xiaoqin Zhang Zengyin Zhao Deli Zhao Yao Zheng Wenming Zhong Baojiang Zhou Cathy Zhou Howard Zhou Jun Zhou Rong Zhou S. Zhu Feng Zhu Wenjun Zitnick Lawrence
Sponsors Key Laboratory of Machine Perception (MOE), Peking University. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. National Natural Science Foundation of China. Microsoft Research. Fujitsu Inc. Microview Inc. Luster Inc.
Table of Contents – Part I
Oral Session 1: Multiple View and Stereo Multiple View Reconstruction of a Quadric of Revolution from Its Occluding Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierre Gurdjos, Vincent Charvillat, G´eraldine Morin, and J´erˆ ome Gu´enard
1
Robust Focal Length Estimation by Voting in Multi-view Scene Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Bujnak, Zuzana Kukelova, and Tomas Pajdla
13
Support Aggregation via Non-linear Diffusion with Disparity-Dependent Support-Weights for Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kuk-Jin Yoon, Yekeun Jeong, and In So Kweon
25
Oral Session 2: Face and Pose Analysis Manifold Estimation in View-Based Feature Space for Face Synthesis across Poses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinyu Huang, Jizhou Gao, Sen-ching S. Cheung, and Ruigang Yang Estimating Human Pose from Occluded Images . . . . . . . . . . . . . . . . . . . . . . Jia-Bin Huang and Ming-Hsuan Yang
37 48
Head Pose Estimation Based on Manifold Embedding and Distance Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiangyang Liu, Hongtao Lu, and Daqiang Zhang
61
3D Reconstruction of Human Motion and Skeleton from Uncalibrated Monocular Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yen-Lin Chen and Jinxiang Chai
71
Oral Session 3: Motion Analysis and Tracking Mean-Shift Object Tracking with a Novel Back-Projection Calculation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LingFeng Wang, HuaiYu Wu, and ChunHong Pan
83
A Shape Derivative Based Approach for Crowd Flow Segmentation . . . . . Si Wu, Zhiwen Yu, and Hau-San Wong
93
Learning Group Activity in Soccer Videos from Local Motion . . . . . . . . . . Yu Kong, Weiming Hu, Xiaoqin Zhang, Hanzi Wang, and Yunde Jia
103
XVI
Table of Contents – Part I
Combining Discriminative and Descriptive Models for Tracking . . . . . . . . Jing Zhang, Duowen Chen, and Ming Tang
113
Oral Session 4: Segmentation From Ramp Discontinuities to Segmentation Tree . . . . . . . . . . . . . . . . . . . . Emre Akbas and Narendra Ahuja Natural Image Segmentation with Adaptive Texture and Boundary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shankar R. Rao, Hossein Mobahi, Allen Y. Yang, S. Shankar Sastry, and Yi Ma
123
135
Gradient Vector Flow over Manifold for Active Contours . . . . . . . . . . . . . . Shaopei Lu and Yuanquan Wang
147
3D Motion Segmentation Using Intensity Trajectory . . . . . . . . . . . . . . . . . . Hua Yang, Greg Welch, Jan-Michael Frahm, and Marc Pollefeys
157
Oral Session 5: Feature Extraction and Object Detection Vehicle Headlights Detection Using Markov Random Fields . . . . . . . . . . . Wei Zhang, Q.M. Jonathan Wu, and Guanghui Wang
169
A Novel Visual Organization Based on Topological Perception . . . . . . . . . Yongzhen Huang, Kaiqi Huang, Tieniu Tan, and Dacheng Tao
180
Multilevel Algebraic Invariants Extraction by Incremental Fitting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bo Zheng, Jun Takamatsu, and Katsushi Ikeuchi Towards Robust Object Detection: Integrated Background Modeling Based on Spatio-temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Tanaka, Atsushi Shimada, Rin-ichiro Taniguchi, Takayoshi Yamashita, and Daisaku Arita
190
201
Oral Session 6: Image Enhancement and Visual Attention Image Enhancement of Low-Light Scenes with Near-Infrared Flash Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sosuke Matsui, Takahiro Okabe, Mihoko Shimano, and Yoichi Sato
213
A Novel Hierarchical Model of Attention: Maximizing Information Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Cao and Liqing Zhang
224
Table of Contents – Part I
Interactive Shadow Removal from a Single Image Using Hierarchical Graph Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daisuke Miyazaki, Yasuyuki Matsushita, and Katsushi Ikeuchi Visual Saliency Based on Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . Yin Li, Yue Zhou, Junchi Yan, Zhibin Niu, and Jie Yang
XVII
234
246
Oral Session 7: Machine Learning Algorithms for Vision Evolving Mean Shift with Adaptive Bandwidth: A Fast and Noise Robust Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qi Zhao, Zhi Yang, Hai Tao, and Wentai Liu
258
An Online Framework for Learning Novel Concepts over Multiple Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luo Jie, Francesco Orabona, and Barbara Caputo
269
Efficient Partial Shape Matching of Outer Contours . . . . . . . . . . . . . . . . . . Michael Donoser, Hayko Riemenschneider, and Horst Bischof
281
Level Set Segmentation Based on Local Gaussian Distribution Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Li Wang, Jim Macione, Quansen Sun, Deshen Xia, and Chunming Li
Oral Session 8: Object Categorization and Face Recognition Categorization of Multiple Objects in a Scene without Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Yang, Nanning Zheng, Mei Chen, Yang Yang, and Jie Yang
303
Distance-Based Multiple Paths Quantization of Vocabulary Tree for Object and Scene Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heng Yang, Qing Wang, and Ellen Yi-Luen Do
313
Image-Set Based Face Recognition Using Boosted Global and Local Principal Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xi Li, Kazuhiro Fukui, and Nanning Zheng
323
Incorporating Spatial Correlogram into Bag-of-Features Model for Scene Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yingbin Zheng, Hong Lu, Cheng Jin, and Xiangyang Xue
333
XVIII
Table of Contents – Part I
Oral Session 9: Biometrics and Surveillance Human Action Recognition under Log-Euclidean Riemannian Metric . . . Chunfeng Yuan, Weiming Hu, Xi Li, Stephen Maybank, and Guan Luo
343
Clustering-Based Descriptors for Fingerprint Indexing and Fast Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shihua He, Chao Zhang, and Pengwei Hao
354
Temporal-Spatial Local Gaussian Process Experts for Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xu Zhao, Yun Fu, and Yuncai Liu
364
Finger-Vein Recognition Based on a Bank of Gabor Filters . . . . . . . . . . . . Jinfeng Yang, Yihua Shi, and Jinli Yang
374
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
385
Table of Contents – Part II
Poster Session 1: Stereo, Motion Analysis, and Tracking A Dynamic Programming Approach to Maximizing Tracks for Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Mooser, Suya You, Ulrich Neumann, Raphael Grasset, and Mark Billinghurst
1
Dense and Accurate Spatio-temporal Multi-view Stereovision . . . . . . . . . . J´erˆ ome Courchay, Jean-Philippe Pons, Pascal Monasse, and Renaud Keriven
11
Semi-supervised Feature Selection for Gender Classification . . . . . . . . . . . Jing Wu, William A.P. Smith, and Edwin R. Hancock
23
Planar Scene Modeling from Quasiconvex Subproblems . . . . . . . . . . . . . . . Visesh Chari, Anil Nelakanti, Chetan Jakkoju, and C.V. Jawahar
34
Fast Depth Map Compression and Meshing with Compressed Tritree . . . Michel Sarkis, Waqar Zia, and Klaus Diepold
44
A Three-Phase Approach to Photometric Calibration for Multiprojector Display Using LCD Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Zhang, Siyu Liang, Bo Qin, and Zhongding Jiang
56
Twisted Cubic: Degeneracy Degree and Relationship with General Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tian Lan, YiHong Wu, and Zhanyi Hu
66
Two-View Geometry and Reconstruction under Quasi-perspective Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guanghui Wang and Q.M. Jonathan Wu
78
Similarity Scores Based on Background Samples . . . . . . . . . . . . . . . . . . . . . Lior Wolf, Tal Hassner, and Yaniv Taigman
88
Human Action Recognition Using Spatio-temporal Classification . . . . . . . Chin-Hsien Fang, Ju-Chin Chen, Chien-Chung Tseng, and Jenn-Jier James Lien
98
Face Alignment Using Boosting and Evolutionary Search . . . . . . . . . . . . . . Hua Zhang, Duanduan Liu, Mannes Poel, and Anton Nijholt
110
XX
Table of Contents – Part II
Tracking Eye Gaze under Coordinated Head Rotations with an Ordinary Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haibo Wang, Chunhong Pan, and Christophe Chaillou
120
Orientation and Scale Invariant Kernel-Based Object Tracking with Probabilistic Emphasizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kwang Moo Yi, Soo Wan Kim, and Jin Young Choi
130
Combining Edge and Color Features for Tracking Partially Occluded Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mandar Dixit and K.S. Venkatesh
140
Incremental Multi-view Face Tracking Based on General View Manifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Wei and Yanning Zhang
150
Hierarchical Model for Joint Detection and Tracking of Multi-target . . . . Jianru Xue, Zheng Ma, and Nanning Zheng
160
Heavy-Tailed Model for Visual Tracking via Robust Subspace Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daojing Wang, Chao Zhang, and Pengwei Hao
172
Efficient Scale-Space Spatiotemporal Saliency Tracking for Distortion-Free Video Retargeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gang Hua, Cha Zhang, Zicheng Liu, Zhengyou Zhang, and Ying Shan
182
Visual Saliency Based Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geng Zhang, Zejian Yuan, Nanning Zheng, Xingdong Sheng, and Tie Liu
193
People Tracking and Segmentation Using Efficient Shape Sequences Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junqiu Wang, Yasushi Yagi, and Yasushi Makihara
204
Monocular Template-Based Tracking of Inextensible Deformable Surfaces under L2 -Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuhan Shen, Wenhuan Shi, and Yuncai Liu
214
A Graph-Based Feature Combination Approach to Object Tracking . . . . Quang Anh Nguyen, Antonio Robles-Kelly, and Jun Zhou
224
A Smarter Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoqin Zhang, Weiming Hu, and Steve Maybank
236
Robust Real-Time Multiple Target Tracking . . . . . . . . . . . . . . . . . . . . . . . . . Nicolai von Hoyningen-Huene and Michael Beetz
247
Dynamic Kernel-Based Progressive Particle Filter for 3D Human Motion Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shih-Yao Lin and I-Cheng Chang
257
Table of Contents – Part II
Bayesian 3D Human Body Pose Tracking from Depth Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Youding Zhu and Kikuo Fujimura Crowd Flow Characterization with Optimal Control Theory . . . . . . . . . . . Pierre Allain, Nicolas Courty, and Thomas Corpetti
XXI
267 279
Human Action Recognition Using HDP by Integrating Motion and Location Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuo Ariki, Takuya Tonaru, and Tetsuya Takiguchi
291
Detecting Spatiotemporal Structure Boundaries: Beyond Motion Discontinuities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konstantinos G. Derpanis and Richard P. Wildes
301
An Accelerated Human Motion Tracking System Based on Voxel Reconstruction under Complex Environments . . . . . . . . . . . . . . . . . . . . . . . . Junchi Yan, Yin Li, Enliang Zheng, and Yuncai Liu
313
Automated Center of Radial Distortion Estimation, Using Active Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamed Rezazadegan Tavakoli and Hamid Reza Pourreza
325
Rotation Averaging with Application to Camera-Rig Calibration . . . . . . . Yuchao Dai, Jochen Trumpf, Hongdong Li, Nick Barnes, and Richard Hartley Single-Camera Multi-baseline Stereo Using Fish-Eye Lens and Mirrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Jiang, Masao Shimizu, and Masatoshi Okutomi Generation of an Omnidirectional Video without Invisible Areas Using Image Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Norihiko Kawai, Kotaro Machikita, Tomokazu Sato, and Naokazu Yokoya
335
347
359
Accurate and Efficient Cost Aggregation Strategy for Stereo Correspondence Based on Approximated Joint Bilateral Filtering . . . . . . Stefano Mattoccia, Simone Giardino, and Andrea Gambini
371
Detecting Critical Configurations for Dividing Long Image Sequences for Factorization-Based 3-D Scene Reconstruction . . . . . . . . . . . . . . . . . . . . Ping Li, Rene Klein Gunnewiek, and Peter de With
381
Scene Gist: A Holistic Generative Model of Natural Image . . . . . . . . . . . . Bolei Zhou and Liqing Zhang A Robust Algorithm for Color Correction between Two Stereo Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qi Wang, Xi Sun, and Zengfu Wang
395
405
XXII
Table of Contents – Part II
Efficient Human Action Detection Using a Transferable Distance Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weilong Yang, Yang Wang, and Greg Mori Crease Detection on Noisy Meshes via Probabilistic Scale Selection . . . . . Tao Luo, Huai-Yu Wu, and Hongbin Zha Improved Uncalibrated View Synthesis by Extended Positioning of Virtual Cameras and Image Quality Optimization . . . . . . . . . . . . . . . . . . . . Fabian Gigengack and Xiaoyi Jiang Region Based Color Image Retrieval Using Curvelet Transform . . . . . . . . Md. Monirul Islam, Dengsheng Zhang, and Guojun Lu Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akitsugu Noguchi and Keiji Yanai Multi-view Texturing of Imprecise Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ehsan Aganj, Pascal Monasse, and Renaud Keriven
417 427
438 448
458 468
Poster Session 2: Segmentation, Detection, Color and Texture Semantic Classification in Aerial Imagery by Integrating Appearance and Height Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Kluckner, Thomas Mauthner, Peter M. Roth, and Horst Bischof
477
Real-Time Video Matting Based on Bilayer Segmentation . . . . . . . . . . . . . Viet-Quoc Pham, Keita Takahashi, and Takeshi Naemura
489
Transductive Segmentation of Textured Meshes . . . . . . . . . . . . . . . . . . . . . . Anne-Laure Chauve, Jean-Philippe Pons, Jean-Yves Audibert, and Renaud Keriven
502
Levels of Details for Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . Vincent Garcia, Frank Nielsen, and Richard Nock
514
A Blind Robust Watermarking Scheme Based on ICA and Image Dividing Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuqiang Cao and Weiguo Gong MIFT: A Mirror Reflection Invariant Feature Descriptor . . . . . . . . . . . . . . Xiaojie Guo, Xiaochun Cao, Jiawan Zhang, and Xuewei Li Detection of Vehicle Manufacture Logos Using Contextual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenting Lu, Honggang Zhang, Kunyan Lan, and Jun Guo
526 536
546
Table of Contents – Part II
XXIII
Part-Based Object Detection Using Cascades of Boosted Classifiers . . . . . Xiaozhen Xia, Wuyi Yang, Heping Li, and Shuwu Zhang
556
A Novel Self-created Tree Structure Based Multi-view Face Detection . . . Xu Yang, Xin Yang, and Huilin Xiong
566
Multilinear Nonparametric Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . Xu Zhang, Xiangqun Zhang, Jian Cao, and Yushu Liu
576
A Harris-Like Scale Invariant Feature Detector . . . . . . . . . . . . . . . . . . . . . . Yinan Yu, Kaiqi Huang, and Tieniu Tan
586
Probabilistic Cascade Random Fields for Man-Made Structure Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Songfeng Zheng A Novel System for Robust Text Location and Recognition of Book Covers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhiyuan Zhang, Kaiyue Qi, Kai Chen, Chenxuan Li, Jianbo Chen, and Haibing Guan
596
608
A Multi-scale Bilateral Structure Tensor Based Corner Detector . . . . . . . Lin Zhang, Lei Zhang, and David Zhang
618
Pedestrian Recognition Using Second-Order HOG Feature . . . . . . . . . . . . . Hui Cao, Koichiro Yamaguchi, Takashi Naito, and Yoshiki Ninomiya
628
Fabric Defect Detection and Classification Using Gabor Filters and Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Zhang, Zhaoyang Lu, and Jing Li Moving Object Segmentation in the H.264 Compressed Domain . . . . . . . . Changfeng Niu and Yushu Liu
635
645
Video Segmentation Using Iterated Graph Cuts Based on Spatio-temporal Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomoyuki Nagahashi, Hironobu Fujiyoshi, and Takeo Kanade
655
Spectral Graph Partitioning Based on a Random Walk Diffusion Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xi Li, Weiming Hu, Zhongfei Zhang, and Yang Liu
667
Iterated Graph Cuts for Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . Bo Peng, Lei Zhang, and Jian Yang Contour Extraction Based on Surround Inhibition and Contour Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuan Li, Jianzhou Zhang, and Ping Jiang
677
687
XXIV
Table of Contents – Part II
Confidence-Based Color Modeling for Online Video Segmentation . . . . . . Fan Zhong, Xueying Qin, Jiazhou Chen, Wei Hua, and Qunsheng Peng
697
Multicue Graph Mincut for Image Segmentation . . . . . . . . . . . . . . . . . . . . . Wei Feng, Lei Xie, and Zhi-Qiang Liu
707
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
719
Table of Contents – Part III
Exploiting Intensity Inhomogeneity to Extract Textured Objects from Natural Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jundi Ding, Jialie Shen, HweeHwa Pang, Songcan Chen, and Jingyu Yang
1
Convolutional Virtual Electric Field External Force for Active Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuanquan Wang and Yunde Jia
11
An Effective Segmentation for Noise-Based Image Verification Using Gamma Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ling Cai, Yiren Xu, Lei He, Yuming Zhao, and Xin Yang
21
Refined Exponential Filter with Applications to Image Restoration and Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanlin Geng, Tong Lin, Zhouchen Lin, and Pengwei Hao
33
Color Correction and Compression for Multi-view Video Using H.264 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boxin Shi, Yangxi Li, Lin Liu, and Chao Xu
43
A Subjective Method for Image Segmentation Evaluation . . . . . . . . . . . . . Qi Wang and Zengfu Wang
53
Real-Time Object Detection with Adaptive Background Model and Margined Sign Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ayaka Yamamoto and Yoshio Iwai
65
A 7-Round Parallel Hardware-Saving Accelerator for Gaussian and DoG Pyramid Construction Part of SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . Jingbang Qiu, Tianci Huang, and Takeshi Ikenaga
75
Weighted Map for Reflectance and Shading Separation Using a Single Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sung-Hsien Hsieh, Chih-Wei Fang, Te-Hsun Wang, Chien-Hung Chu, and Jenn-Jier James Lien Polygonal Light Source Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dirk Schnieders, Kwan-Yee K. Wong, and Zhenwen Dai
85
96
XXVI
Table of Contents – Part III
Face Relighting Based on Multi-spectral Quotient Image and Illumination Tensorfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming Shao, Yunhong Wang, and Peijiang Liu Perception-Based Lighting Adjustment of Image Sequences . . . . . . . . . . . . Xiaoyue Jiang, Ping Fan, Ilse Ravyse, Hichem Sahli, Jianguo Huang, Rongchun Zhao, and Yanning Zhang Ultrasound Speckle Reduction via Super Resolution and Nonlinear Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bo Wang, Tian Cao, Yuguo Dai, and Dong C. Liu Background Estimation Based on Device Pixel Structures for Silhouette Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasutomo Kawanishi, Takuya Funatomi, Koh Kakusho, and Michihiko Minoh Local Spatial Co-occurrence for Background Subtraction via Adaptive Binned Kernel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bineng Zhong, Shaohui Liu, and Hongxun Yao Gable Roof Description by Self-Avoiding Polygon . . . . . . . . . . . . . . . . . . . . Qiongchen Wang, Zhiguo Jiang, Junli Yang, Danpei Zhao, and Zhenwei Shi Tracking Endocardial Boundary and Motion via Graph Cut Distribution Matching and Multiple Model Filtering . . . . . . . . . . . . . . . . . . Kumaradevan Punithakumar, Ismail Ben Ayed, Ali Islam, Ian Ross, and Shuo Li
108
118
130
140
152
162
172
Object Detection with Multiple Motion Models . . . . . . . . . . . . . . . . . . . . . . Zhijie Wang and Hong Zhang
183
An Improved Template Matching Method for Object Detection . . . . . . . . Duc Thanh Nguyen, Wanqing Li, and Philip Ogunbona
193
Poster Session 3: Machine Learning, Recognition, Biometrics and Surveillance Unfolding a Face: From Singular to Manifold . . . . . . . . . . . . . . . . . . . . . . . . Ognjen Arandjelovi´c Fingerspelling Recognition through Classification of Letter-to-Letter Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Susanna Ricco and Carlo Tomasi
203
214
Table of Contents – Part III
Human Action Recognition Using Non-separable Oriented 3D Dual-Tree Complex Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rashid Minhas, Aryaz Baradarani, Sepideh Seifzadeh, and Q.M. Jonathan Wu
XXVII
226
Gender from Body: A Biologically-Inspired Approach with Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guodong Guo, Guowang Mu, and Yun Fu
236
Fingerprint Orientation Field Estimation: Model of Primary Ridge for Global Structure and Model of Secondary Ridge for Correction . . . . . . . . Huanxi Liu, Xiaowei Lv, Xiong Li, and Yuncai Liu
246
Gait Recognition Using Procrustes Shape Analysis and Shape Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuanyuan Zhang, Niqing Yang, Wei Li, Xiaojuan Wu, and Qiuqi Ruan Person De-identification in Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prachi Agrawal and P.J. Narayanan A Variant of the Trace Quotient Formulation for Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Wang, Chunhua Shen, Hong Zheng, and Zhang Ren
256
266
277
Adaptive-Scale Robust Estimator Using Distribution Model Fitting . . . . Thanh Trung Ngo, Hajime Nagahara, Ryusuke Sagawa, Yasuhiro Mukaigawa, Masahiko Yachida, and Yasushi Yagi
287
A Scalable Algorithm for Learning a Mahalanobis Distance Metric . . . . . Junae Kim, Chunhua Shen, and Lei Wang
299
Lorentzian Discriminant Projection and Its Applications . . . . . . . . . . . . . . Risheng Liu, Zhixun Su, Zhouchen Lin, and Xiaoyu Hou
311
Learning Bundle Manifold by Double Neighborhood Graphs . . . . . . . . . . . Chun-guang Li, Jun Guo, and Hong-gang Zhang
321
Training Support Vector Machines on Large Sets of Image Data . . . . . . . Ignas Kukenys, Brendan McCane, and Tim Neumegen
331
Learning Logic Rules for Scene Interpretation Based on Markov Logic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mai Xu and Maria Petrou Efficient Classification of Images with Taxonomies . . . . . . . . . . . . . . . . . . . Alexander Binder, Motoaki Kawanabe, and Ulf Brefeld
341
351
XXVIII
Table of Contents – Part III
Adapting SVM Image Classifiers to Changes in Imaging Conditions Using Incremental SVM: An Application to Car Detection . . . . . . . . . . . . Epifanio Bagarinao, Takio Kurita, Masakatsu Higashikubo, and Hiroaki Inayoshi
363
Incrementally Discovering Object Classes Using Similarity Propagation and Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shengping Xia and Edwin R. Hancock
373
Image Classification Using Probability Higher-Order Local Auto-Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tetsu Matsukawa and Takio Kurita
384
Disparity Estimation in a Layered Image for Reflection Stereo . . . . . . . . . Masao Shimizu, Masatoshi Okutomi, and Wei Jiang
395
Model-Based 3D Object Localization Using Occluding Contours . . . . . . . Kenichi Maruyama, Yoshihiro Kawai, and Fumiaki Tomita
406
A Probabilistic Model for Correspondence Problems Using Random Walks with Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tae Hoon Kim, Kyoung Mu Lee, and Sang Uk Lee
416
Highly-Automatic MI Based Multiple 2D/3D Image Registration Using Self-initialized Geodesic Feature Correspondences . . . . . . . . . . . . . . . . . . . . Hongwei Zheng, Ioan Cleju, and Dietmar Saupe
426
Better Correspondence by Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shufei Fan, Rupert Brooks, and Frank P. Ferrie
436
Image Content Based Curve Matching Using HMCD Descriptor . . . . . . . . Zhiheng Wang, Hongmin Liu, and Fuchao Wu
448
Skeleton Graph Matching Based on Critical Points Using Path Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yao Xu, Bo Wang, Wenyu Liu, and Xiang Bai
456
A Statistical-Structural Constraint Model for Cartoon Face Wrinkle Representation and Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ping Wei, Yuehu Liu, Nanning Zheng, and Yang Yang
466
Spatially Varying Regularization of Image Sequences Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yaozu An, Yao Lu, and Zhengang Zhai
475
Image Search Result Summarization with Informative Priors . . . . . . . . . . Rui Liu, Linjun Yang, and Xian-Sheng Hua
485
Interactive Super-Resolution through Neighbor Embedding . . . . . . . . . . . . Jian Pu, Junping Zhang, Peihong Guo, and Xiaoru Yuan
496
Table of Contents – Part III
Scalable Image Retrieval Based on Feature Forest . . . . . . . . . . . . . . . . . . . . Jinlong Song, Yong Ma, Fuqiao Hu, Yuming Zhao, and Shihong Lao Super-Resolution of Multiple Moving 3D Objects with Pixel-Based Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takuma Yamaguchi, Hiroshi Kawasaki, Ryo Furukawa, and Toshihiro Nakayama Human Action Recognition Using Pyramid Vocabulary Tree . . . . . . . . . . . Chunfeng Yuan, Xi Li, Weiming Hu, and Hanzi Wang Auto-scaled Incremental Tensor Subspace Learning for Region Based Rate Control Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Zhang, Sabu Emmanuel, Yanning Zhang, and Xuan Jing
XXIX
506
516
527
538
Visual Focus of Attention Recognition in the Ambient Kitchen . . . . . . . . . Ligeng Dong, Huijun Di, Linmi Tao, Guangyou Xu, and Patrick Oliver
548
Polymorphous Facial Trait Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ping-Han Lee, Gee-Sern Hsu, and Yi-Ping Hung
560
Face Recognition by Estimating Facial Distinctive Information Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bangyou Da, Nong Sang, and Chi Li
570
Robust 3D Face Recognition Based on Rejection and Adaptive Region Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoli Li and Feipeng Da
581
Face Recognition via AAM and Multi-features Fusion on Riemannian Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongwen Huo and Jufu Feng
591
Gender Recognition via Locality Preserving Tensor Analysis on Face Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huining Qiu, Wan-quan Liu, and Jian-Huang Lai
601
A Chromosome Image Recognition Method Based on Subregions . . . . . . . Toru Abe, Chieko Hamada, and Tetsuo Kinoshita
611
Co-occurrence Random Forests for Object Localization and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-Wu Chu and Tyng-Luh Liu
621
Solving Multilabel MRFs Using Incremental α-Expansion on the GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vibhav Vineet and P.J. Narayanan
633
XXX
Table of Contents – Part III
Non-rigid Shape Matching Using Geometry and Photometry . . . . . . . . . . Nicolas Thorstensen and Renaud Keriven
644
Beyond Pairwise Shape Similarity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Kontschieder, Michael Donoser, and Horst Bischof
655
Globally Optimal Spatio-temporal Reconstruction from Cluttered Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ehsan Aganj, Jean-Philippe Pons, and Renaud Keriven
667
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
679
Multiple View Reconstruction of a Quadric of Revolution from Its Occluding Contours Pierre Gurdjos, Vincent Charvillat, G´eraldine Morin, and J´erˆome Gu´enard IRIT-ENSEEIHT, Toulouse, France {pgurdjos,charvi,morin,jguenard}@enseeiht.fr
Abstract. The problem of reconstructing a quadric from its occluding contours is one of the earliest problems in computer vision e.g., see [1,2,3]. It is known that three contours from three views are required for this problem to be well-posed while Cross et al. have proved in [4] that, with only two contours, what can be obtained is a 1D linear family of solutions in the dual projective space. In this work, we describe a multiple view algorithm that unambiguously reconstructs so-called Prolate Quadrics of Revolution (PQoR’s, see text), given at least two finite projective cameras (see terminology in [5, p157]). In particular, we show how to obtain a closed-form solution. The key result on which is based this work is a dual parameterization of a PQoR, using a 7-dof ‘linear combination’ of the quadric dual to the principal focus-pair and the Dual Absolute Quadric (DAQ). One of the contributions is to prove that the images of the principal foci of a PQoR can be recovered set-wise from the images of the PQoR and the DAQ. The performance of the proposed algorithm is illustrated on simulations and experiments with real images.
1
Introduction
The now well-established maturity of 3D reconstruction paradigms in computer vision is due in part to the contribution of projective geometry, which allowed in particular to design stratified reconstruction strategies and to understand the link between camera calibration and the Euclidean structure of a projective 2- or 3- space [5]. More recently, projective geometry also threw light on some earlier vision problems like the 3D reconstruction of quadratic surfaces, quadratic curves or surfaces of revolution, from one or multiple views [4,6,7,8,9,10]. The image of a general 9-dof quadric is a 5-dof conic which is usually referred to as the occluding contour or outline of the quadric. In 1998, Cross et al. described in [4] a linear triangulation scheme for a quadric from its outlines using dual space geometry. The important result in [4] was that, like linear triangulation for points, the proposed scheme works for both finite and general projective cameras, according to whether the camera maps a 3D Euclidean or a projective world to pixels (see terminology in [5, p157]). Nevertheless, a key difference is that triangulation for a quadric requires at least 3 outlines from 3 views whereas with only 2 outlines an ambiguity remains i.e., a 1D linear family of solutions is found. H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 1–12, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
P. Gurdjos et al.
In this work, we investigate Quadrics of Revolution (QoR’s) which are 7-dof quadrics generated by revolving a conic about one of its symmetry axis. We describe a multiple view algorithm that unambiguously reconstructs a Prolate Quadric of Revolution (PQoR) given its occluding contours from 2 or more views taken by known finite projective cameras. The term PQoR refers to any QoR such that its revolution axis is the axis through the two real foci of the revolved conic, called principal foci of the PQoR. Albeit the term ‘prolate’ usually only applies to ellipsoids of revolution, PQoR’s will also include here hyperboloids of two sheets of revolution. The problem is stated using dual space geometry and our contributions can be summarized as follows. – We describe a 7-dof parameterization of a PQoR via a ‘linear combination’ of two dual quadrics: the quadric dual to the principal focus-pair of the PQoR and the dual absolute quadric, – We prove that the images of the principal foci can be recovered set-wise from the image of the PQoR and the image of the dual absolute quadric. The quadric dual to the principal focus-pair is a (degenerate) rank-2 quadric envelope consisting of the two (real) principal foci of the PQoR. The dual absolute quadric (DAQ [5, p84]) is also a (degenerate) rank-3 dual quadric, denoted Q∗∞ , which can be regarded as the (plane-)envelope of the absolute conic (AC ) in 3-space. Its image coincides with the image of the dual absolute conic, denoted ω ∗ . Prior works. In the earliest works, Ferry et al. [3] reconstructed a known QoR, from one outline and Ma [1] reconstructed a general ellipsoid given three outlines. Cross et al. [4] reconstructed quadrics given three outlines from three views or two outlines with one matched point while Shashua et al. [9] requires one outline and four matched points. Eventually, Wijewickrema et al. [7] described a twostep algorithm for reconstructing spheres given two outlines from two views. Up to our knowledge, there is no existing work that describes how to linearly reconstruct a quadric of revolution from two views or more using a linear method. Notations. All vectors or matrices are homogeneous i.e., they represent points or homographies in nD-projective space, hence their lengths or orders are n + 1. We will basically follow the same notations as in [5]. For any square matrix A, eig(A) refers to the set of eigenvalues of the matrix A and eig(A, B) refers to the set of generalized eigenvalues of the matrix-pair (A, B). Finite projective cameras are described by 3 × 4 matrices Pi = Ki Ri I | ti , where Ri and ti relate the orientation and position of camera i to some Euclidean 3D-world coordinate system. The matrix Ki is the calibration matrix (of intrinsics) which has the general form given in [5, Eq. (6.10), p157]. 1.1
Background
Projective quadrics. In order-n projective space Pn , a quadric is the locus of points X ∈ Pn satisfying a quadratic equation X QX = 0 for some order-(n + 1)
Multiple View Reconstruction of a Quadric of Revolution
3
symmetric matrix Q, called quadric matrix. If n = 2 a quadric is called a conic. In dual projective space Pn∗ , a quadric is the envelope of hyperplanes π ∈ P∗n satisfying a quadratic equation π Q∗ π = 0, for some ‘dual’ order-(n + 1) symmetric matrix Q∗ . A proper quadric is a self-dual surface, comprising both a quadric locus and a quadric envelope [11, p267] as Q ∼ (Q∗ )−1 if the hyperplanes of Q∗ are exactly the tangent hyperplanes of Q. In this work, to be consistent with [5, p73], we will also use the term ‘dual quadric’ when referring to a quadric envelope. − Under any homography H, a quadric locus Q maps to Q ∼ H QH−1 , while the ∗ envelope Q∗ maps to Q ∼ HQ∗ H . The action of a camera P maps a quadric Q to conic C on the image plane [5, p201] and this is derived in dual form by C∗ ∼ PQ∗ P . The rank of a quadric (in locus or envelope form) is the rank of its matrix and a rank-deficient quadric is said to be degenerate. What is important to be reminded is that a degenerate quadric envelope Q∗ with rank-2 consists of a pair of points (Y, Z) as it can be written as Q∗ = YZ + ZY . We will often refer to the envelope Q∗ as the quadric dual to the point-pair (Y, Z). Signature of projective quadrics. A projective quadric (in locus or envelope form), whose matrix is real, has a different type according to its signature. The signature is defined as (ξ1 , ξ2 ), where ξ1 and ξ2 are the following functions of its matrix eigenvalues ξ1 = max(ρ, ν) and ξ2 = min(ρ, ν), in which ρ and ν respectively count the positive and negative eigenvalues. Thanks to Sylvester’s law of inertia [12, p. 403], the signature is projectively invariant in the sense of being the same in any projective representation. In particular the signature of any rank-2 quadric envelope Q∗ is (1, 1) if Q∗ consists of a pair of distinct real points, (ξ1 , ξ2 ) = (1) (2, 0) if Q∗ consists of a pair of conjugate complex points. Note that ξ1 + ξ2 = rank Q∗ so the rank is also projectively invariant.
2 2.1
Parameterization and Reconstruction of a Prolate Quadric of Revolution (PQoR) A Dual Parameterization of a PQoR
We describe here the proposed parameterization of a PQoR whose matrix is denoted Q in the sequel. Proposition 1. Let F and G be the 4-vectors of two distinct real points in 3-space and let Q∗∞ be the matrix of the DAQ. The set of PQoR’s having F and G as principal foci can be represented, in envelope form, by 4 × 4 generic symmetric matrices Q∗ (u) = uX∗ + (1 − u)Q∗∞ ,
(2)
4
P. Gurdjos et al.
where u ∈ R is a free variable and X∗ = FG + GF , is the quadric dual to the principal focus-pair (F, G). This proposition actually makes use of a known result of projective geometry so its proof can be found elsewhere (e.g., see [11,13]). Just one remark about it. The set of matrices of the above proposition represents a 1D linear family of quadrics in envelope form, such a family being called a range of quadrics 1 in [11, p335], which is the dual of a pencil of quadrics. Our key idea is to write Q, the PQoR to be reconstructed, in dual form as Q∗ = X∗ − x0 Q∗∞ ,
(3)
where x0 is a nonzero scalar, that we will call the projective parameter of Q∗ . A minimal parameterization. It is well-known that the DAQ is fixed under similarities and has the canonical form ⎛ ⎞ 1000 ⎜0 1 0 0⎟ ⎟ Q∗∞ = ⎜ ⎝0 0 1 0⎠ 0000 in any Euclidean representation [5, p84]. Hence, the matrix form (3) is a dual Euclidean 7-dof parameterization of a general PQoR since the rank-2 matrix X∗ clearly has indeed six dof’s and the scalar x0 has one. This is consistent with existing Euclidean ‘locus’ parameterizations e.g., that given in [14]. 2.2
A Dual Parameterization of the Image of a PQoR
Any range of quadrics projects to a range of conics (i.e., to a 1D linear family of conics envelopes) whose members are the images of the quadrics of the range. Next a proposition that can be seen as the corollary of proposition 1. Corollary 1. The image of the range of quadrics (2) is the range of conics represented by 3 × 3 generic symmetric matrices C∗ (v) = vY∗ + (1 − v)ω ∗ ,
(4)
where v ∈ R is a free variable and Y∗ = fg + gf ,
(5)
is the conic dual to the images f , g of principal foci F, G and ω ∗ = KK is the dual image of the AC. 1
A necessary and sufficient condition for a quadric range to consist entirely of envelopes of confocal quadrics is that it includes Q∗∞ [11, p335]. A quadric range satisfying this condition is called a confocal range.
Multiple View Reconstruction of a Quadric of Revolution
5
There is no difficulty with the proof of this corollary which is hence omitted. Why is the corollary important? The range of conics (4), image of the quadric range (2), is a 1D linear family of envelopes including C∗ i.e., the dual occluding contour of the PQoR. As any conic range, it can be spanned by any two of its members, in particular by C∗ and ω ∗ so we can also refer to it as C∗ − yω ∗ . The conic range C∗ − yω ∗ includes three degenerate members with singular matrices C∗ − λi ω ∗ , i ∈ {1..3}, among which is the dual conic (5). Now our key result: proposition 2 claims that the envelope (5) can be uniquely identified among the three degenerate members of C∗ − yω ∗ , thanks to its (projectively invariant) signature. Proposition 2. Given the image ω of the AC and the occluding contour C of a PQoR, the images {f , g} of the real principal foci of the PQoR can be set-wise recovered i.e., the degenerate dual conic (5) can be uniquely recovered. Proof. The conic range C∗ − yω ∗ includes three degenerate envelopes with singular matrices C∗ − λi ω ∗ where λi is a generalized eigenvalue of the matrix-pair (C∗ , ω ∗ ) i.e., a root of the characteristic equation det(C∗ −yω ∗ ) = 0. Hence, there exists λi0 such that Y∗ ∼ C∗ − λi0 ω ∗ . Since Y∗ has signature (1, 1), cf. (1), we now just show that the two other degenerate members have signatures (2, 0)x. On the one hand, it is known that the generalized eigenvalues of (C∗ , ω ∗ ) are also the (ordinary) eigenvalues of the matrix C∗ ω, where ω = (ω ∗ )−1 . Furthermore if H is a order-3 nonsingular transformation then C∗ ω and H(C∗ ω)H−1 are similar in the sense of having the same set of eigenvalues. Consequently, eig(C∗ , ω ∗ ) = eig(C∗ ω) = eig(HC∗ ωH−1 ) = eig(HC∗ (H H− )ωH−1 ) = eig(HC∗ H , Hω ∗ H ). This entails that the set of the projective parameters of the degenerate members of C∗ − yω ∗ is the same as that of HC∗ H − yHω ∗ H as it remains invariant —as a set– under any homography H. This allows us to carry on the proof using the most convenient projective representation. On the other hand, there exists an homography H of the image plane such that ⎛ ⎞ ⎛ ⎞ a00 100 HC∗ H−1 = ⎝ 0 b 0 ⎠ D and Hω ∗ H−1 = ⎝ 0 1 0 ⎠ = I, (6) 00c 001 where a, b, c ∈ R \ {0} assuming, without loss of any generality, that a > b > c. ¯∗ is the eigen-decomposition (V is orthogonal) of To be convinced, if VDV = C ¯∗ = K−1 C∗ K− then H = V K−1 is such an the ‘calibrated’ occluding contour C homography. The projective parameters λi of the degenerate envelopes C∗ −λi ω ∗ ¯∗ I can be formally computed as the generalized eigenvalues of (D, I) —since ω represents the calibrated dual image of the AC, see (6)— i.e., as the ordinary
6
P. Gurdjos et al.
eigenvalues of D. Clearly these are a, b, c. Then it is straightforward to compute, ¯∗i = D − λi ω ¯ ∗ and their (ordinary) eigenvalues: in this order, the matrices Y ⎧ ⎫ ⎛ ⎞ a−c 0 0 ⎨a − c⎬ ¯∗1 = ⎝ 0 b − c 0 ⎠ , eig(Y ¯∗1 ) = b − c , (ξ11 , ξ21 ) = (2, 0) , Y ⎩ ⎭ 0 0 0 0 ⎧ ⎫ ⎛ ⎞ a−b0 0 ⎨a − b⎬ ¯∗2 = ⎝ 0 0 0 ⎠ , eig(Y ¯∗2 ) = c − b , (ξ12 , ξ22 ) = (1, 1) , Y ⎩ ⎭ 0 0 c−b 0 ⎧ ⎫ ⎛ ⎞ 0 0 0 ⎨b − a⎬ ¯∗3 = ⎝ 0 b − a 0 ⎠ , eig(Y ¯∗3 ) = c − a , (ξ13 , ξ23 ) = (2, 0) . Y ⎩ ⎭ 0 0 c−a 0 Since the signatures are projectively invariant, the proof is ended. The proof actually showed that Y∗ , the conic dual to the image of the principal focus-pair as given in (5), is the only degenerate member of C∗ − yω ∗ with signature (1, 1), cf. (1). Algorithm 1 details all the steps based on the proof. Algorithm 1. Input C: PQoR’s occluding contour; K: intrinsics. Output r: image of the revolution axis; fg + gf : conic dual to the images of the principal foci. 1. 2. 3. 4. 5.
3 3.1
¯∗ = K−1 C∗ K− ‘Calibrate’ the occluding contour C Compute the eigenvalues and eigenvectors {λi , Vi }i=1..3 of ¯ C∗ ∗ ¯ Find the (unique) matrix C − λi0 I having signature (1, 1), with i0 ∈ {1..3} ... The image of the revolution axis is r = KVi0 ... The dual conic fg + gf is the ‘uncalibrated’ dual conic K(¯ C∗ − λi0 I)K
Reconstructing a PQoR from Its Occluding Contours Case of Two Calibrated Views
Fitting a PQoR to 3D planes. Let C be the occluding contour of Q. The backprojection in 3-space of any image line l tangent to C at a point p is the 3D plane π = P l [5, p197], defined by the camera center and the pole-polar relation l = Cp. This plane belongs the envelope of Q i.e., π Q∗ π = 0 as shown in fig. 1 (a). Hence, using the parameterization (3), the following constraint is satisfied: π X∗ π = x0 (π Q∗∞ π).
(7)
Constraint of the revolution axis. A direct consequence of proposition 2 is that the image of the revolution axis can be uniquely recovered. The revolution axis is the 3D line passing through F and G in space. It projects to the line r, passing through their images f and g, whose back-projection in 3-space is the plane ϕ = P r as shown in fig. 1 (b). Since the plane ϕ passes through the camera center and the revolution axis, the following constraint holds: X ∗ ϕ = 04 .
(8)
Multiple View Reconstruction of a Quadric of Revolution
7
F
C
p
f
l
π = P l
G
ϕ = P r g
r
C
(a)
(b)
Fig. 1. (a) The plane passing through the camera center and tangent to the occluding contour is tangent to the PQoR. (b) The plane passing through the camera center and the images f , g of the principal foci F, G passes through the axis of revolution.
The basic reconstruction equation system. Assume that a PQoR Q is seen by two calibrated views and that the occluding contours Ci of Q are given in each view i ∈ {1, 2}. Thanks to proposition 2, we also have at our disposal the images of the principal foci of Q, ‘pair-wise packaged’ in a rank-2 quadric envelope Y∗i . Let lij , j = 1..2, be two image lines tangent to C and denote ri the image of the revolution axis i.e., of the line through the images of the principal foci. Hence, for each view, we have two equations (7) and one equation (8), which yields as a whole ten independent linear equations on the unknown 11-vector X = (x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 ) ,
(9)
providing we write (the symbol ‘∗’ denotes the corresponding symmetric element) ⎛ ⎞ x1 x2 x3 x4 ⎜ ∗ x5 x6 x7 ⎟ ⎟ X∗ = ⎜ (10) ⎝ ∗ ∗ x8 x9 ⎠ . ∗ ∗ ∗ x10 We can rewrite the set of equations (7,8) in the matrix form 1 A X = 010 , A2
(11)
where Ai is the 5 × 11 data matrix associated with view i: ⎡ i 2 i 2 i 2 i 2 i i i i i i i 2 i i i (a1 ) +(b1 ) +(c1 )
⎢ (ai2 )2 +(bi2 )2 +(ci2 )2 ⎢ 0 A =⎢ ⎢ ⎣ 0 i
0
(a1 )
2a1 b1 2a1 c1 2a1 d1 (b1 )
2b1 c1 2b1 di1 (ci1 )2 2ci1 di1 (di1 )2
(ai2 )2 2ai2 bi2 2ai2 ci2 2ai2 di2 (bi2 )2 2bi2 ci2 2bi2 di2 (ci2 )2 2ci2 di2 (di2 )2 0
0
ϕi1
0
0
0
ϕi1
0
0
ϕi2
ϕi2 ϕi3
ϕi1
ϕi2
ϕi3
ϕi4
0
0
using the notation πji = (aij , bij , cij , dij ) .
0
ϕi3
ϕi4
ϕi4
0
0
0
0
0
0
0
0
⎤ ⎥ ⎥ ⎥, ⎥ ⎦
(12)
8
P. Gurdjos et al.
System (11) is the basic reconstruction equation system. The last three rows of Ai correspond to constraint (8) substituting ϕi = Pi ri for ϕ. They fix 6 out of the 10 dof’s of X in (9). Note that, for i = 1..2, these constraints ensure that rank X∗ = 2. The first two rows of Ai correspond to two constraints (7), substituting πji = Pi lij for π with j = 1..2; they fix the remaining 4 dof’s. 1 A System (11) is now a consistent system i.e., rank = 10 so a non-trivial A2 2 exact solution for X (e.g., under the constraint X = 1) exists. In other words, the problem of the reconstruction of a quadric of revolution from its occluding contours in two calibrated views is well-posed. 3.2
Case of Multiple Calibrated Views
In practice, multiple calibrated views, say n ≥ 2, can be available and the system (11) will transform to DX ≈ 05n , where D is a 5n × 11 design matrix that stacks all the Ai ’s, i = 1..n, and the operator ‘≈’ expresses the fact that data matrices Ai are generally perturbed by noise. Numerically speaking, we will seek a closedform solution to the total least-squares problem minX DX2 s.t. X = 1. Algorithm 2 details all the steps for constructing the design matrix D. The solution can be taken as the singular vector of D associated with the smaller singular value [5, p593]. Note that, in this case, we will ignore the constraint rank X∗ = 2. Algorithm 2. Input {Pi }: finite projective cameras; {Ci }: occluding contours; {Ki }: intrinsics. Output D: the design matrix for reconstructing the PQoR. For each view i 1. Pick up two points pij of Ci , with j ∈ {1..2} 2. Let lij = Ci pij be the tangent lines at pij 3. Let πji = Pi lij be the back-projections of lij 4. Using algorithm 1, recover ri , the image of the revolution axis 5. Let ϕi = Pi ri be the back-projections of ri 6. Stack the block-matrix Ai , as defined in (12), into the design matrix D. 7. Solve minX DX2 s.t. X = 1
4 4.1
Results Synthetic Data
In this section, we assess the performance of the proposed algorithm by carrying out comparison tests with Cross’ algorithm described in [15] using synthetic data. First, we remind the reader that only the proposed method can reconstruct a PQoR from two views. Cross’ algorithm returns in such a minimal case a 1D linear family of solutions i.e., a quadric range of solutions.
Multiple View Reconstruction of a Quadric of Revolution
9
8 6 4 2 0
0
2 noise level
4
10
6
8
5
6 4 2 0
4 3 2 1
0
2 noise level
4
0
0
2 noise level
4
10
8 6 4 2 0
10
error on the centre
error on the semi−major axis (%)
10
error on the volume (%)
error on the angle of the axis of rotation (deg)
error on the semi−minor axis (%)
In the first experiment, simulations have been achieved using a set of three calibrated cameras, with varying positions and orientations. Cameras are randomly and approximatively placed at a constant distance from the scene origin. Each camera fixates a random point located in a sphere of varying radius centred at the origin. Ellipsoids of revolution are randomly created with its center varying within this sphere. We compute the exact images of the ellipsoid in all views and then add Gaussian noise to the points of the obtained occluding contours, with zero mean and standard deviation σ = n% of the major axis of the image, where n varies between 0 and 5. We compute different types of errors between the exact and the reconstructed quadrics: relative errors on the minor and the major axes, the absolute errors between the centers, errors on the angles between the directions of the axes of rotation and eventually errors on the volumes. For each level of n, 500 independent trials were performed using our proposed approach and that of Cross [14]. The results are presented (Figure 2, page 9).
0
2 noise level
4
8 6
proposed approach Cross approach
4 2 0
0
2 noise level
4
Fig. 2. The proposed algorithm are compared to Cross’ one (see text). Three cameras are used for this experiment (Cross’ algorithm requires at least three cameras).
We can note that the errors linearly increase with regard to the noise level. We can also see that our method has clearly better performances than Cross’ method. In the second experiment, we investigate the importance of the number of views for the reconstruction. We run the same experiments but now varying the number of cameras from 2 to 8. All results are averages of 500 independent trials using a Gaussian noise with σ = 2% of the major axis of the image of the quadric (Figure 3, page 10). The results for Cross’ approach start from three views only.
P. Gurdjos et al.
2 1.5 1 0.5 0
2
4 6 number of views
8
3
2
1
0
35 30
2 1.5 1 0.5 0
25 20 15 10 5
2
4 6 number of views
8
0
2
4 6 number of views
8
3 error on the volume (%)
4
3 2.5 error on the centre
error on the semi−major axis (%)
3 2.5
error on the angle of the axis of rotation (deg)
error on the semi−minor axis (%)
10
2
4 6 number of views
8
2.5 proposed approach
2
Cross approach 1.5 1 0.5 0
2
4 6 number of views
8
Fig. 3. Comparison tests (see text). Algorithms are compared with respect to the number of cameras.
Fig. 4. Reconstruction of a bunch of grapes (see text). This is an interactive graphics: click to activate.
Multiple View Reconstruction of a Quadric of Revolution
11
Due to the fast increasing number of constraints, this second experiment shows that the errors decrease exponentially as the number of views increases. We can again note that our method clearly is the most accurate. 4.2
Real Data
An application perfectly suited to illustrate our work is the reconstruction of objects like fruits and, in particular, bunch of grapes, where grapes are modelled by prolate ellipsoids of revolution. We used two calibrated cameras, looking at the grape with an angle of approximatively 30 degrees ; the obtained result is shown in figure 4.
5
Conclusions
In this work, we describe a multiple view algorithm that unambiguously reconstructs so-called prolate quadrics of revolution, given at least two finite projective cameras. In particular, we show how to obtain a closed-form solution. The key result was to describe how to recover the conic dual to the images of the principal foci of the quadric, providing the calibration matrix is known. Albeit not described in this paper, it is possible to recover the two points individually, even if it is not possible to distinguish one from the other. A future work is to compute the epipolar geometry of the scene i.e., the essential matrix, from the images of the principal foci in the two views, using a RANSAC-like algorithm, and hence only requiring the calibration matrix for reconstructing PQoR’s.
References 1. Ma, S.D., Li, L.: Ellipsoid reconstruction from three perspective views. In: ICPR, Washington, DC, USA, p. 344. IEEE Computer Society, Los Alamitos (1996) 2. Karl, W.C., Verghese, G.C., Willsky, A.S.: Reconstructing ellipsoids from projections. CVGIP 56(2), 124–139 (1994) 3. Ferri, M., Mangili, F., Viano, G.: Projective pose estimation of linear and quadratic primitives in monocular computer vision. CVGIP 58(1), 66–84 (1993) 4. Cross, G., Zisserman, A.: Quadric reconstruction from dual-space geometry. In: ICCV, Washington, DC, USA, p. 25. IEEE Computer Society, Los Alamitos (1998) 5. Hartley, R., Zisserman, A.: Multiple View Geometry in computer vision. Cambridge Univ. Press, Cambridge (2003) 6. Colombo, C., Del Bimbo, A., Pernici, F.: Metric 3d reconstruction and texture acquisition of surfaces of revolution from a single uncalibrated view. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 99–114 (2005) 7. Wijewickrema, S.N.R., Paplinski, A.P., Esson, C.E.: Reconstruction of spheres using occluding contours from stereo images. In: ICPR, Washington, DC, USA, pp. 151–154. IEEE Computer Society, Los Alamitos (2006) 8. Zhang, H., Member-Wong, K.Y.K., Zhang, G.: Camera calibration from images of spheres. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 499–502 (2007) 9. Shashua, A., Toelg, S.: The quadric reference surface: Theory and applications. Int. J. Comput. Vision 23(2), 185–198 (1997)
12
P. Gurdjos et al.
10. Wong, K.Y.K., Mendon¸ca, P.R.S., Cipolla, R.: Camera calibration from surfaces of revolution. IEEE Trans. Pattern Anal. Mach. Intell. 25(2), 147–161 (2003) 11. Semple, J.G., Kneebone, G.T.: Algebraic Projective Geometry, 1998th edn. Oxford University Press, Oxford (1952) 12. Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. 1996. Johns Hopkins University Press, Baltimore (1983) 13. Sommerville, D.: Analytical Geometry of Three Dimensions. Cambridge University Press, Cambridge (1939) 14. Goldman, R.: Quadrics of revolution. IEEE Comput. Graph. Appl. 3(2), 68–76 (1983) 15. Cross, G., Zisserman, A.: Surface reconstruction from multiple views using apparent contours and surface texture, pp. 25–47 (2000)
Robust Focal Length Estimation by Voting in Multi-view Scene Reconstruction Martin Bujnak1 , Zuzana Kukelova2 , and Tomas Pajdla2 1
2
Bzovicka 24, 85107, Bratislava, Slovakia Center for Machine Perception, Czech Technical University in Prague
Abstract. We propose a new robust focal length estimation method in multi-view structure from motion from unordered data sets, e.g. downloaded from the Flickr database, where jpeg-exif headers are often incorrect or missing. The method is based on a combination of RANSAC with weighted kernel voting and can use any algorithm for estimating epipolar geometry and unknown focal lengths. We demonstrate by experiments with synthetic and real data that the method produces reliable focal length estimates which are better than estimates obtained using RANSAC or kernel voting alone and which are in most real situations very close to the ground truth. An important feature of this method is the ability to detect image pairs close to critical configurations or the cases when the focal length can’t be reliably estimated. Keywords: focal length, epipolar geometry, 3D reconstruction.
1
Introduction
Estimating the focal length of an unknown camera is an important computer vision problem with applications mainly in 3D reconstruction. Previously, uncalibrated cameras were used to create a projective 3D reconstruction of the observed scene which was then upgraded to a metric one by enforcing camera properties [8]. Another approach was to first calibrate cameras and then register cameras directly in Euclidean space. This was shown to produce better results even for large scale datasets [23,18,17,24,19]. Efficient solvers, e.g. the 5-pt relative pose solver for calibrated cameras [21,26], also helped in developing such Structure from Motion (SFM) pipelines. An interesting open problem appears with modern digital cameras when the internal parameters [8] except for the focal length are known. Sometimes, it is possible to extract focal lengths from the jpeg-exif headers. This was often done in the above mentioned SFM pipelines [23,18,17,24]. Unfortunately, many images downloaded from photo-sharing websites do not contain jpeg-exif headers, or listed focal lengths are not correct due to image editing. A number of algorithms for simultaneous estimation of camera motion and focal length have been invented: the 7-pt or 8-pt algorithm for uncalibrated
This work has been supported by EC project FP7-SPACE-218814 PRoVisG and by Czech Government under the research program MSM6840770038.
H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 13–24, 2010. c Springer-Verlag Berlin Heidelberg 2010
14
M. Bujnak, Z. Kukelova, and T. Pajdla
cameras [8] followed by the extraction of two focal lengths from the fundamental matrix [7,2,14], or by the extraction of one focal length common to both cameras [27], or by the extraction of one focal length assuming that the second focal length is known [30], the 6-pt algorithm for cameras with unknown but same focal length [25,16,15], the 6-pt algorithm for one unknown and one known focal length [3]. Although these algorithms are well understood and fast, they are rarely used in SFM pipelines. This has mainly the two following reasons. First, all above mentioned algorithms suffer from some critical configurations, e.g. when optical axes of the cameras are parallel or intersecting [13], or if the scene is planar. In these situations, it is not possible to compute the focal lengths because there exist many Euclidean interpretations of images. Secondly, every image is usually matched with many different images and therefore one obtains several (often many) candidates for the estimated camera focal length. Mostly these focal lengths are different and one can’t select the best one easily. Selecting the focal length with the largest number of inliers or selecting the median or mean focal length do not always produce satisfactory results since estimated geometries may be wrong. In this paper we propose a new multi-view method for robust focal length estimation based on a combination of RANSAC with weighted kernel voting. Our method can use any focal length extraction algorithm (6-pt, 7-pt, etc.). We follow the paradigm proposed in [16] where a simple kernel voting method was successfully used for estimating focal lengths by the 6-pt algorithm. This method draws 6-tuples of corresponding points, estimates unknown focal lengths and stores them into a vector. Kernel voting is used to smooth data and to select the best focal length after several trials. A combination of kernel voting method with the RANSAC paradigm was used in [22,28,29] to estimate epipoles (resp. camera translations). Work [22] introduced the idea of splitting the epiplar geometry estimation to first estimating the translation (epipole) and then the rest plus the global uncertainty of the epipolar geometry. A data driven sampling was used to estimate translation candidates. The best model was then selected in a secondary sampling process initialized by the translation candidates. In [28,29], votes were not casted directly by each sampled epipolar geometry but by the best epipolar geometries recovered by ordered sampling of PROSAC [11]. The PROSAC with 5000 cycles was run 50 times and its results were collected by the kernel voting. This lead up to 25000 samples but usually terminated much sooner. Here we use a more complex, hybrid, sampling strategy, which turns out to be more efficient than the approach of [28,29]. In our method, statistics are collected either directly inside a RANSAC loop or in separate sampling process executed on an inliers set returned by a robust RANSAC estimator like DEGENSAC [10]. This is followed by the kernel voting weighted using weights derived from the number of inliers of each particular vote. All reliable votes are accumulated in camera accumulators and contribute to the camera focal length estimation. Finally, camera focal lengths are obtained by kernel voting on votes obtained from all pairwise matchings.
Robust Focal Length Estimation
2
15
Problems in Focal Length Estimation
It is known that RANSAC [5], RANSAC voting [28,29] and standard kernel voting [16], produce good and reliable estimates of focal lengths for a single image pair in general configuration and under small contaminations by outliers and noise. However, problems occur when we have image pairs close to critical configurations, degenerate scenes, higher numbers of outliers and large noise or when we need to select the camera focal length from several candidates obtained by matching one image with many different images. Next we describe each from these issues in more detail, show how they affect existing methods and proposed some solutions which will lead to our new method for robust focal length estimation. 2.1
Outliers
RANSAC is robust to outliers since after sufficiently many cycles we get at least one “outlier free” sample which results in a model with the greatest support. The correctness of the best model is, however, not guaranteed. Large contamination by outliers causes major problems in the standard kernel voting method, see Figure 1 (b). This is because the probability of drawing a good sample dramatically decreases with increasing the number of outliers. Even increasing the number of voting cycles does not solve the problem, since false peaks remain or new appear, see Figure 1 (c). On the other hand a model estimated from an outlier contaminated sample usually does not have high support. Therefore we weight the vote generated by a sample by the number of inliers supporting the model of the sample. This reduces the influence of outliers and false peaks disappear, Figure 1 (d).
0.5
3
3
2.5
2.5
2 1.5 1 0.5
0 0
2 4 6 focal length [image width unit size]
(a)
0 0
2
Frequency
Frequency
Frequency
1
3 2.5 Frequency
2
1.5
1.5 1 0.5
2 4 6 8 10 focal length [image width unit size]
(b)
0 0
2 1.5 1 0.5
2 4 6 8 10 focal length [image width unit size]
(c)
0 0
2 4 6 8 10 focal length [image width unit size]
(d)
Fig. 1. (a) Standard kernel voting with 500 trials on general outliers free scene. Results for scene with 40% of outliers and (b) 500 trials resp. (c) 5000 trials. (d) Kernel voting weighted by the number of inliers, 5000 trials and 40% of outliers. The results for the left focal length are in blue, for the right focal length in red and the ground truth focal lengths are displayed as cyan vertical lines.
2.2
Noise
Kernel voting as well as RANSAC are immune to contamination by small noise. However, for higher noise levels both methods may deliver wrong focal length estimates. In RANSAC it is not possible to use the size of the support to determine
16
M. Bujnak, Z. Kukelova, and T. Pajdla
if the estimated focal length is reliable or not. For example, critical configurations may result in wrong epipolar geometries with large supports. Hence another methods need to be used to measure the reliability of the result [8]. Kernel voting, on the other hand, provides information about the reliability of the estimated result. It either produces the result as a dominant peak or noise level is too high, which serves as a certificate that the camera pair is not reliable. Based on these observations we incorporate a detection of cases when the focal length can’t be reliably estimated to our method, e.g. due to large noise contamination. We use kernel voting and the estimated focal length is considered reliable only if the highest peak is sufficiently higher than the second highest peak. 2.3
Critical Configurations
It is known that critical configurations cause major problems in focal length estimation [13]. If a critical configuration appears, it is not possible to estimate the epipolar geometry and focal length because there exist several (infinite number of) Euclidean interpretations of the structure and camera parameters. Hence we need to detect and reject camera pairs in critical configurations. Unfortunately, in real situations many critical configurations can’t be easily detected. When the camera pair is near the critical configuration, which can’t be easily detected, the estimated focal lengths are almost random and the support is usually high. Therefore, RANSAC often returns some result with high support which is however far from the ground truth value. This can be seen in Figure 2 (a) and (b) which shows boxplots of focal lengths obtained by 1000 runs of the DEGENSAC [10], where we extracted focal lengths using the Bougnoux equations [2] inside the DEGENSAC loop. In each run of the DEGENSAC the real focal length with the highest support was returned. Figure 2 (a) shows results for the real scene where camera optical axes were almost intersecting. This is the critical configuration for a pair of cameras with constant or varying focal lengths [13]. Because in this case the configuration was not perfectly critical, i.e. principal points did not perfectly matched, the DEGENSAC always returned some focal lengths and epipolar geometry with a good support. However, the focal lengths were wrong.
40
20
0
7pt+f1
7pt+f2
(a)
6pt const f
100
Frequency
focal length [mm]
focal length [mm]
60
50
2
2
1.5
1.5 Frequency
150
80
1
0.5
0
7pt + f1
7pt + f2
(b)
6pt const f
0 0
1
0.5
1 2 3 4 5 focal length [image width unit size]
(c)
0 0
1 2 3 4 5 focal length [image width unit size]
(d)
Fig. 2. Real scene in close to critical (a,c) and non-critical (b,d) configurations. (a,b) show boxplots from 1000 runs of DEGENSAC algorithm with focal length extractions. (c,d) show results of weighted kernel voting. Cyan lines are the ground truth values.
Robust Focal Length Estimation
17
Unfortunately it is not possible to determine whether the estimated focal length is correct from one result of the DEGENSAC. This is also not completely clear by comparing results of multiple runs of the DEGENSAC, as it can be seen from the Figure 2. Here the variations of the focal lengths estimated from 1000 runs of the DEGENSAC are very similar for scenes close to critical configuration Figure 2 (a) and for non-critical configuration Figure 2 (b). Again this is not a problem for kernel voting as it can be seen from the Figure 2 (c) and (d) where the results for the same sequences and the weighted kernel voting on the data collected during a single execution of the DEGENSAC are shown. More peaks in Figure 2 (c) are results of the model instability near the critical motion. The plot looks crisp too, since many votes were dropped due to the detected epipolar geometry degeneracy or because extracted focal lengths were complex. On the other hand the result for the general scene (Figure 2 (d)) is nicely smooth with only a single peak. Therefore it is meaningful to consider the estimated focal length reliable only if the highest peak is sufficiently higher and more consistent than the remaining data. 2.4
Degenerate Scenes
Degenerate scenes produce results with high support but usually with incorrect focal lengths. For example, in scenes with dominant planar structure it often happens that all points from the sample are on the plane. Thus the epipolar geometry is degenerate but all points on the plane match this epipolar geometry perfectly [8] and the standard as well as weighted kernel voting and RANSAC fail to estimate the correct focal length. Therefore we combine our kernel voting method with degeneracy tests. Note that in scenes containing dominant planes we can’t use the number of inliers as weights since degenerated focal lengths have high support on points from the plane. Therefore we use weights estimated only from the points off the plane. Figure 3 left shows result for the standard kernel voting without a test on planar scene degeneracy. The right plot in Figure 3 shows the result of our kernel voting where the planarity is taken into account. 3
2
2.5
Frequency
Frequency
1.5
1
2 1.5 1
0.5 0.5 0 0
1 2 3 4 focal length [image width unit size]
5
0 0
1 2 3 4 focal length [image width unit size]
5
Fig. 3. The kernel voting on the scene with a dominant plane with only 10% off the plane points. Standard kernel voting (left), proposed algorithm with dominant plane detection (right). Left (right) focal length is blue (red), ground truth is cyan.
18
2.5
M. Bujnak, Z. Kukelova, and T. Pajdla
Multiple Focal Length Candidates
It often happens that we have several candidates for camera focal length obtained by running RANSAC several times for one image pair or by running RANSAC for several image pairs with common cameras. Mostly, these focal lengths are different as it can be seen in Figure 2 (b) and it is difficult to select the correct one. Strategies like selecting the focal length with the largest number of inliers, selecting the median or mean focal length, or running standard kernel voting on results from RANSAC [28,29] do not always produce satisfactory results. To solve this problem we collect reliable candidate focal lengths with their weights for each camera pair, respectively each run of the RANSAC. Then we use weighted kernel voting to select the best focal length from these candidates.
3
The Robust Method for Focal Length Estimation
Unlike previous works [21,22,28,29], we execute single RANSAC algorithm and then postprocess obtained inliers. idea is the following: If we executed a The RANSAC based algorithm on all N7 7-tuples chosen out of N tentative matches, then we would obtain all maximal inlier sets. In general, each of the maximal inlier sets can be obtained from many different sampled 7-tuples. Each 7-tuple generating a maximal inlier set may, however, result in a different epipolar geometry and different focal length. For reliable estimates, the distribution of these focal lengths should have a clear dominant peak. To speed-up the process, we run RANSAC only once to obtain an inlier set. Then, we study the distribution of the focal lengths which result from 7-tuples sampled from and generating this inlier set (or its similarly sized subset). In this way one can determine if the estimated focal lengths are reliable and also select the “best” focal length as the value corresponding to the highest peak in the distribution. Our weighted kernel voting algorithm is a cascade consisting of four phases. Block diagram of the algorithm is presented in Figure 4 and the pseudo code of this algorithm can be found in [4].
Phase 1
Phase 2
Phase 3
Sample from clusters
C++
Calculate model (F, f1, f2) and its support
failed
(C> M) or (V > N)
V++
DEGENSAC
Create a new cluster
Remember f1,f2 and model support size
failed OK OK
Focal lengths out of range test
…
Intersecting optical axes test
Analyze collected data
OK
failed
Planar degeneracy test
Fig. 4. Block diagram of the weighted kernel voting algorithm. See text for description.
Robust Focal Length Estimation
3.1
19
Phase 1 - Matches Selection
The main goal of the first phase is to achieve computational efficiency by quickly rejecting easy mismatches and thus not wasting time and effort in the next phase. We run DEGENSAC [10] which returns a set of matches in which the proportion of mismatches is greatly reduced and most of correct matches are preserved. In other words the decision process of the first phase generates a negligible number of falsely rejected good matches (false negatives) but a non-negligible number of correctly rejected false matches (true negatives). It is important to use DEGENSAC [10] or a similar algorithm which is capable of detecting panoramas and pure planar scene configurations and obtaining inliers which are not affected by presence of dominant planes. Panoramas and planar scenes are rejected. 3.2
Phase 2 - Votes Collecting
The second phase of our algorithm is used to collect “focal length votes”. Each vote, i.e. each estimated focal length, is weighted by the support of the epipolar geometry corresponding to the focal length. The higher the support of the model, the higher the weight of the estimated focal length. It is important to filter degenerate models since they usually have good supports but incorrect focal lengths. The algorithm tries to collect N votes (non-degenerate epipolar geometries with their focal lengths) in less than M (M > N ) trials. Since input data are already inliers, we cannot use ordinary statistics developed for the RANSAC to estimate M because it would be too small. In our experiments we set N = 50 and M = 100. We rejected a camera pair if it was not possible to collect 50 nondegenerated votes in 100 trials, or in other words if V /C < 0.5, where V ≤ N is the number of collected non-degenerated votes in C ≤ M trials. Note that this phase is as computationally expensive as at most M additional RANSAC cycles. For M = 100 this is amounts to 10’s of milliseconds. Clusters. To avoid computation of degenerate epipolar geometries we divide all matches into several clusters. We distinguish planar clusters and the remaining data (the “Zero cluster”). Each planar cluster represents a set of points laying on a non-negligible plane. Clusters smaller than five points and all remaining matches are stored in the Zero cluster. The algorithm starts by putting all matches to the Zero cluster. Then, the clusters corresponding to planes in the scene are automatically created during the algorithm runtime as will be explained next. Model calculation. Computation of epipolar geometries is done using a small (often minimal) number of point correspondences required to estimate the model. Correspondences are drawn from different clusters to avoid selection of points laying in one plane. Since the Zero cluster contains points in general position, we also allow sampling all correspondences from this cluster.
20
M. Bujnak, Z. Kukelova, and T. Pajdla
Various solvers can be used to calculate fundamental matrices and focal lengths. It is better to use information about cameras whenever available since this yields more stable parameter estimation [8,25,15,27,3,30]. Degeneracy tests. Several degeneracy tests are executed to avoid voting of degenerate samples/models. First, models with focal lengths that are outside a reasonable interval are ignored. These may be products of too noisy data or mismatches. Similarly, votes resulting from cameras with intersecting optical axes, i.e. (0, 0, 1)F(0, 0, 1)T = 0, are rejected. For the plane degeneracy test we are using test developed in DEGENSAC [10]. If at least 6 points are on the plane or sample was drawn from the zero cluster and 5 points are on the plane then we create a new cluster. First, a plane is calculated from 5 or 6 points. Then, points on the plane are removed from clusters and a new cluster using on plane points is created. Finally, clusters with less than 5 points are relabeled to Zero cluster. Although the plane degenerate samples are ignored, each such sample creates a new cluster with points laying on the plane. Since samples are drawn from different clusters then the probability of sampling a new plane degenerate sample is gradually decreasing as more and more dominant planes are removed from the Zero cluster. 3.3
Phase 3 - Votes Analysis
After the votes are collected, the algorithm determines whether the estimated focal lengths are consistent and reliable by analyzing collected data. First, if a camera pair was close to some critical configuration [13], then almost all votes were rejected by degeneracy tests (see above) and hence the number of trials C required for obtaining V votes was high. If the fraction of non-degenerated votes and the number of cycles is small, i.e. V /C < 0.5, then we reject such a camera pair. Next, the weighted kernel method with weights estimated from the support of each focal length is used to estimate the kernel density approximation of the probability density function of collected focal lengths. If the distribution produces a dominant peak, i.e. the highest peak is at least 20% above the remaining data, we extract the focal length as the argument of its maximum. Otherwise we ignore the camera pair. We consider the estimated focal length as reliable only if both these criteria are met. 3.4
Phase 4 - Multi-view Voting
For the multi-view voting process we create an accumulator for each camera where the results from camera pairwise estimations are collected. Each accumulator is a vector covering the range from 20mm to 150mm with 1mm tessellation. Given the result of a pairwise estimating we first analyze if the result is reliable. We do this using the two criteria described in Section 3.3. If both these conditions are satisfied, then we add votes with their weights to the camera accumulator
Robust Focal Length Estimation
21
otherwise we reject the camera pair. After all data are collected we run the final kernel voting for accumulator data.
4
Experiments
4.1
Synthetic Data Set
We study the performance of the method on synthetically generated groundtruth general 3D scenes as well as on the scenes with dominant planes. Scenes were generated as random points on a plane or in a 3D cube depending on the testing configuration or using a combination of both to get a planar scene with minor 3D structure. Each 3D point was projected by several cameras, where each camera orientation, position and focal length was selected randomly. Gaussian noise with a standard deviation σ was added to each image point. Noise free data set. Behavior of the standard kernel voting on noise free general 3D scenes was studied for the 6-pt algorithm with equal focal lengths already in [16]. The results are similar for the 7-pt algorithm followed by a focal length extraction. There is no reason for this algorithm to fail. The behavior on planar scenes and for cameras near a critical configuration is different. Omitting degeneracy test causes that the standard kernel voting completely fails. This is shown in Figure 3. Since our algorithm samples points from different clusters, i.e. points from different planes, it rarely tests a 7-tuple of points laying on the plane. If it happens, i.e. when a planar sample is drawn from the Zero cluster, degeneracy test detects it and a new cluster is created. Adding outliers to the data does not affect the result since outlying votes are weak due to small support and hence weight. This is shown in Figure 1.
0.1 0.2 0.5 1 1.5 noise in pixels [1000×1000 image size]
(a)
−8 0.1 0.2 0.5 1 1.5 noise in pixels [1000×1000 image size]
(b)
relative error of focal length
−6
−4 −6
10
−4
0 −2
−8 0.1 0.2 0.5 1 1.5 noise in pixels [1000×1000 image size]
(c)
Log
−8
relative error of focal length
−6
10
−4
−2
Log
0 −2
Log10 relative error of focal length
Log10 relative error of focal length
Data affected by noise. It was demonstrated in [16] that the kernel voting is able to pick values close to the ground true value even for data affected by noise. In our experiments we fixed the focal length of the first camera to 35mm and generated 1000 random scene setups as described above. For each setup we executed 100 cycles of voting. We did the same for each selected noise level. Figure 5 (a) summarizes the results and shows that focal lengths estimated using our kernel voting method are accurate. 0 −2 −4 −6 −8 0.1 0.2 0.5 1 1.5 noise in pixels [1000×1000 image size]
(d)
Fig. 5. Deviation of estimated focal length of the first camera using proposed voting approach in general scene (a,b) and scene with a dominant plane (c,d). Individual camera pairs are displayed in (a,c), grouped votes from 5-pairs in (b,d).
22
M. Bujnak, Z. Kukelova, and T. Pajdla
Figure 5 (b) shows the results, where we generated six cameras in each scene setup. Then votes from all five camera pairs between the first and the ith camera were used to vote for the focal length of the first camera. Obtained estimates are even more accurate. Next, we repeated the above tests for a scene where 80% of all points are on a plane. Results are summarized in Figure 5 (c,d). It can be seen that results for planar scenes are slightly less accurate than the results for general scenes (a,b). This may be caused by the fact that it is harder to fit a good model to such data due to smaller amount of good matches. Adding outliers to the tests did not affect the result too much. This is because the RANSAC and weighting with model support inside the voting algorithm can cope with outliers after if a sufficient number of trials are executed. We omit these results here, since they look similar to the ones obtained for outliers free scenes. 4.2
Real Data Set
To evaluate our voting approach on a real data we downloaded 2500 images from the Flickr [6] database using “Di trevi” keywords. In every such image we extracted SURF [1] feature points and descriptors. Tentative correspondences between each two calibrated images were obtained as points where the best descriptor dominates by 20% over the second best descriptor [20]. Then we used the DEGENSAC [10] algorithm to estimate inlaying correspondences and 100 cycles of our voting algorithm to analyze the quality of the estimated geometry of the pair. Each reliable geometry (see Section 3) was then added to the camera accumulators. We created accumulators with the range from 20mm to 200mm with one millimeter tessellation. From the 2500 images we found only 240 images where focal length could be extracted from the jpeg-exif headers. About 130 of them were either showing something different or could not be matched, i.e. the number of correct tentative correspondences was less then 30. Algorithm marked about 30 images as unreliable. The jpeg-exif focal lengths of the remaining 80 images were compared with results of our algorithm, see Figure 6 (top left). It can be seen that the estimated focal lengths (red dots) are in most cases very close to the focal lengths extracted from the jpeg-exif headers (green crosses). Examples of votes coming from individual camera pairs are shown in Figure 6 (bottom). The top right plot in the figure shows the final accumulator after applying KDE. Result of our method is displayed in red, standard kernel voting on inliers in black and vertical lines represent jpeg-exif focal length (cyan), mean (green), median (red) and result with max support (blue) from several DEGENSAC runs. Figure 2 shows the results of standard DEGENSAC (a,b) and our method (c,d) for real images taken with known camera in close to a critical configuration (a,c) and general configuration (b,d). As it can be seen DEGENSAC returned inaccurate estimates many times even for general scene. However, our method was able to detect scene closet to critical configuration and to estimate focal lengths close to the ground truth value for general configurations.
Focal length [mm]
Robust Focal Length Estimation 200
2
150
1.5
100
1
50
0.5
0
0
10
20
30
40 Image number
50
60
70
0
80
exif median mean maxinl kern our
20
50 80 110 focal length [mm]
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
0
20
80
140
0
20
80
140
0
20
80
140
0
20
80
140
0
20
80
140
0
20
80
140
0
3
3
3
3
3
3
3
2
2
2
2
2
2
2
1 0
1
20
80
140
0
1
20
80
140
0
1
20
80
140
0
1
20
80
140
0
1
20
80
140
0
23
140
1
20
80
140
0
20
80
140
1
20
80
140
0
20
80
140
Fig. 6. Estimated focal lengths (red dots) with ground truth values (green crosses) extracted from the jpeg-exif (top-left). Distribution of votes for selected camera (bottom) and result for final multi-view voting (top-right). Our method is displayed in red, standard kernel voting on inliers in black. Vertical lines represent jpeg-exif focal(cyan), mean(green), median(red) and result with max support(blue) from DEGENSAC.
5
Conclusion
We have proposed a new, fast, multi-view method for robust focal length estimation. This method can be used with any focal length extraction algorithm (e.g. 6-pt, 7-pt, etc.), combines the RANSAC paradigm with the weighted kernel voting using weights derived from the number of inliers, contains detection of planar scenes and some critical configurations, thanks to which it can detect “bad” pairs and handle dominant planes. This method produces reliable focal length estimates which are better then estimates obtained using plain RANSAC or kernel voting, and which are in most real situation very close to the ground truth values. This method is useful in SFM especially from unordered data sets downloaded from the Internet.
References 1. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-Up Robust Features (SURF). CVIU 110, 346–359 (2008) 2. Bougnoux, S.: From projective to Euclidean space under and practical situation, a criticism of self-calibration. In: ICCV (1998) 3. Bujnak, M., Kukelova, Z., Pajdla, T.: 3D reconstruction from image collections with a single known focal length. In: ICCV (2009) 4. Bujnak, M., Kukelova, Z., Pajdla, T.: Robust focal length estimation by voting in multi-view scene reconstruction. Research Report CTU-CMP-2009-09 (2009) 5. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM 24(6), 381–395 (1981) 6. Flickr, http://www.flickr.com/
24
M. Bujnak, Z. Kukelova, and T. Pajdla
7. Hartley, R.: Estimation of relative camera positions for uncalibrated cameras. In: Sandini, G. (ed.) ECCV 1992. LNCS, vol. 588, pp. 579–587. Springer, Heidelberg (1992) 8. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003) 9. Chum, O., Matas, J., Kittler, J.: Locally Optimized RANSAC. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 236–243. Springer, Heidelberg (2003) 10. Chum, O., Werner, T., Matas, J.: Two-View Geometry Estimation Unaffected by a Dominant Plane. In: CVPR 2005, pp. 772–779 (2005) 11. Chum, O., Matas, J.: Matching with PROSAC - Progressive Sample Consensus. In: CVPR 2005 (2005) 12. Jones, M.C., Marron, J.S., Sheather, S.J.: A brief survey of bandwidth selection for density estimation. J. Amer. Stat. Assoc. 91(433), 401–407 (1996) 13. Kahl, F., Triggs, B.: Critical Motions in Euclidean Structure from Motion. In: CVPR 1999, pp. 23–66 (1999) 14. Kanatani, K., Matsunaga, C.: Closed-form expression for focal lengths from the fundamental matrix. In: ACCV 2000, Taipei, Taiwan, vol. 1, pp. 128–133 (2000) 15. Kukelova, Z., Bujnak, M., Pajdla, T.: Polynomial eigenvalue solutions to the 5-pt and 6-pt relative pose problems. In: BMVC 2008 (2008) 16. Li, H.: A simple solution to the six-point two-view focal-length problem. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 200– 213. Springer, Heidelberg (2006) 17. Li, X., Wu, C., Zach, C., Lazebnik, S., Frahm, J.: Modeling and recognition of landmark image collections using iconic scene graphs. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 427–440. Springer, Heidelberg (2008) 18. Martinec, D., Pajdla, T.: Robust Rotation and Translation Estimation in multiview Reconstruction. In: CVPR (2007) 19. Microsoft PhotoSynth., http://www.photosynth.net 20. Muja, M., Lowe, D.: Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration, University of British Columbia (2008) (preprint) 21. Nister, D.: An efficient solution to the five-point relative pose. IEEE PAMI 26(6), 756–770 (2004) 22. Nister, D., Engels, C.: Visually Estimated Motion of Vehicle-Mounted Cameras with Global Uncertainty. In: SPIE Defense and Security Symposium, Unmanned Systems Technology VIII (April 2006) 23. Snavely, N., Seitz, S.M., Szeliski, R.S.: Photo Tourism: Exploring image collections in 3D. In: SIGGRAPH 2006, pp. 835–846 (2006) 24. Snavely, N., Seitz, S., Szeliski, R.: Skeletal graphs for efficient structure from motion. In: CVPR 2008 (2008) 25. Stewenius, H., Nister, D., Kahl, F., Schaffalitzky, F.: A minimal solution for relative pose with unknown focal length. In: CVPR 2005, pp. 789–794 (2005) 26. Stewenius, H., Engels, C., Nister, D.: Recent developments on direct relative orientation. ISPRS J. of Photogrammetry and Remote Sensing 60, 284–294 (2006) 27. Sturm, P.: On Focal Length Calibration from Two Views. In: CVPR 2001 (2001) 28. Torii, A., Havlena, M., Pajdla, T., Leibe, B.: Measuring Camera Translation by the Dominant Apical Angle. In: CVPR 2008, Anchorage, Alaska, USA (2008) 29. Torii, A., Pajdla, T.: Omnidirectional camera motion estimation. In: VISAPP 2008 (2008) 30. Urbanek, M., Horaud, R., Sturm, P.: Combining Off- and On-line Calibration of a Digital Camera. In: Third Int. Conf. on 3-D Digital Imaging and Modeling (2001)
Support Aggregation via Non-linear Diffusion with Disparity-Dependent Support-Weights for Stereo Matching Kuk-Jin Yoon1, , Yekeun Jeong2 , and In So Kweon2 1
Computer vision Lab., Dept. Information and Communications, GIST, Korea 2 Robotics and Computer Vision Lab., KAIST, Korea
Abstract. In stereo matching, homogeneous areas, depth continuity areas, and occluded areas need more attention. Many methods try to handle pixels in homogeneous areas by propagating supports. As a result, pixels in homogeneous areas get assigned disparities inferred from the disparities of neighboring pixels. However, at the same time, pixels in depth discontinuity areas get supports from different depths and/or from occluded pixels, and resultant disparity maps are easy to be blurred. To resolve this problem, we propose a non-linear diffusion-based support aggregation method. Supports are iteratively aggregated with the support-weights, while adjusting the support-weights according to disparities to prevent incorrect supports from different depths and/or occluded pixels. As a result, the proposed method yields good results not only in homogeneous areas but also in depth discontinuity areas as the iteration goes on without the critical degradation of performance.
1
Introduction
Stereo matching has been a long lasting research topic in the computer vision community since it is the crux of many classical computer vision problems such as motion estimation, object tracking, object recognition, 3D structure reconstruction, etc. Nowadays, more interests also come from practical applications including robot navigation and image synthesis in computer graphics. Unfortunately, the stereo matching is hard to solve because the mapping from a three dimensional scene to two dimensional images can be inverted in multiple ways. Moreover, as pointed out in [1], there is the inherent image ambiguity that results from the ambiguous local appearances of image points, and image noise, occlusion, sharp depth discontinuities, and specular reflection complicate the stereo matching. To resolve the stereo matching, many methods have been proposed for decades. Feature-based methods match only a few points proper for matching [2] while filtering out ambiguous points. As a result, feature-based methods yield sparse disparity maps.
This work was supported by National Research Foundation of Korea Grant funded by the Korean Government (2009-0065038), and KOCCA of MCST, Korea, under the CT R&D Program 2009.
H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 25–36, 2010. c Springer-Verlag Berlin Heidelberg 2010
26
K.-J. Yoon, Y. Jeong, and I.S. Kweon
On the other hand, area-based methods try to yield a dense disparity map while handling the point ambiguity using some constraints. They can be roughly divided into two categories according to disparity selection methods as in [3]: local methods and global methods. Local methods [4, 5, 6] select disparities of image pixels locally by using the winner-take-all(WTA) method. They typically use some kinds of statistical correlation among color or intensity patterns in local support areas to deal with the ambiguity. Unlike the local methods, global methods [7,8,9,10,11] seek a disparity surface minimizing the global cost function defined by making an explicit smoothness assumption. However, they mainly focus on how to efficiently minimize conventional matching costs, in spite of the fact that lower cost solutions do not always correspond to better performance as pointed out in [12]. Therefore, it is also important to properly define and to measure the point similarity for matching costs. 1.1
Focus of Our Work
Area-based methods commonly perform the subsets of following steps. The first step is computing the per-pixel matching cost for each considered disparity locally or over a certain area of supports. This measures the (dis-)similarity of two locations across images. Common pixel-based matching costs are the squared intensity difference(SD) and the absolute intensity difference(AD). After the per-pixel raw matching cost computation, supports are aggregated locally by summing over a support area or by diffusion in order to increase robustness and to decrease image ambiguity. Typical methods include sum-of-absolutedifferences(SAD) and sum-of-squared-differences(SSD) that have different types of error functions. Some matching costs such as normalized cross-correlation and non-parametric measures [13,14] are defined over a certain area of supports. These can be regarded as the combination of the per-pixel raw matching cost computation and the aggregation step. After the support aggregation, stereo methods find the best match among all possible candidates under consideration based on the aggregated supports. Local methods choose the best disparity of a pixel at each pixel independently by selecting the disparity corresponding to the minimum aggregated matching costs. On the other hand, global methods find a disparity surface by performing full global optimization using non-linear diffusion [9], graph cut [8, 15, 16], and belief propagation [10] based on specified models with a smoothness constraint. Some recent stereo methods have achieved impressive results with powerful optimization techniques. However, as pointed out in [12] and [17], it is also important to compute cost functions well. Actually, once we have reliable matching costs, we do not need any sophisticated optimization that is time-consuming and requires many user-specified parameters to be tuned. In this sense, we focus on the support aggregation among the stages generally performed in area-based methods, which is highly related to the matching cost computation. The simplest method for support aggregation is probably applying a fixed window containing the pixel of interest. Here, the performance of this approach is highly dependent on the shape/size of a support window because the
Support Aggregation via Non-linear Diffusion
27
Fig. 1. Behavior of conventional support aggregation methods. This figure is from [3]. See [3] for detail.
local support area for a pixel determines which and to what extent neighboring pixels should contribute to averaging. Ideally, the local support area should include all and only those neighboring pixels that are come from the same depth. However, because we do not know the depths of pixels beforehand, it is very difficult to find an optimal window for each pixel. In this context, we proposed a new method in [6], which is based on adaptive support-weights. However, we should use very large local support windows to deal with homogeneous areas because we do not know the sizes of homogeneous regions beforehand — the size of local support windows should be larger than those of the homogeneous regions — and, therefore, huge memory and computation power are required. Instead of using a local support window, the support aggregation can be achieved via diffusion. Here, the problem with this approach is that the size of the support area increases as the iteration goes on. More aggregation (larger number of iterations) clearly helps to recover textureless areas. However, at the same time, too much aggregation causes errors near depth discontinuities. This phenomenon is shown in Fig. 1. Therefore, the number of iterations and some local stopping criteria are vary important factors in this approach, as the size/shape of a local support window in the window-based approach. 1.2
Outline of the Proposed Method
In this work, we propose a new non-linear diffusion method for the support aggregation, aiming at yielding good results not only in homogeneous areas but also in depth discontinuity areas. In fact, the non-linear diffusion method proposed in this work can be though as the combination of the matching cost computation and the support aggregation. The proposed method is essentially based on disparity-dependent support-weights. Supports are iteratively aggregated with the support-weights while adjusting the support-weights according to disparities to prevent ambiguous supports from neighboring pixels. This work can be thought as the extension of our previous works [6, 17, 18], which consider the reference and the target images together in the support
28
K.-J. Yoon, Y. Jeong, and I.S. Kweon
aggregation to reduce the ambiguity. Actually, considering both images simultaneously is very important to reduce the effect of outliers such as pixels from different depths or half-occluded. To prevent erroneous supports coming from different depths, most previous works make use of color or intensity gradient, assuming that depth discontinuities coincide with color or intensity edges. This assumption is generally valid, but previous methods consider only the reference image and this can not help to prevent supports from occluded pixels as shown in Fig. 2. Even though pixels in A and B have the same color, we have to prevent the supports coming from B to A because pixels in B are not visible in the target image. However, this is not possible in previous methods without any complicated processes. Lastly, let us emphasize that we do not claim that the method we present yields the best results among all stereo methods because we focus on only the support aggregation. Nevertheless, we are convinced that the support aggregation with disparity-dependent support-weights is helpful for improving the performance of stereo methods. In addition, it is worthy of note that the proposed aggregation method is independent of other stages so that it can be combined with any sophisticated disparity selection methods.
2
Modeling of Support-Weights
We first need to build the model for support-weights assignment. Needless to say, the support from a neighboring pixel is valid only when the neighboring pixel is from the same depth with the pixel under consideration and is not occluded (i.e., visible in both images). Therefore, the support-weight of a neighboring pixel should be in proportion to the probability that it is from the same depth and it is visible in both images. When p is the pixel under consideration and q is a pixel in the support area of p, Np , the simple model can be expressed as w(p, q) ∝ Pr Opq , Ovq = Pr (Opq ) Pr Ovq , (1) when assuming two events Opq and Ovq are independent. Here, w(p, q) represents the support-weight of q and Opq represents the event that p and q are from the reference image
foreground A
B
background target image
foreground A background
Fig. 2. Depth discontinuity area. Supports from different depths and occluded pixels in B to pixels in A may cause false matches.
Support Aggregation via Non-linear Diffusion
Np left image
29
q p
qd pd
N pd right image
d Fig. 3. p and q and p¯d and q¯d in the reference (left) image and in the target (right) image, respectively
same depth. Ovq is the event that q is visible in both images. Here, by introducing the binary function v(q) representing the visibility of q as 1 if q is visible in both images v(q) = , (2) 0 otherwise Ovq represents the event {v(q) = 1}. However, this modeling is for only the reference image and does not consider the depth (i.e., disparity) under consideration. In our work, we extend the model for considering both images1 and the disparity. In this case, our model can be expressed as (3) w(p, q, d) ∝ Pr Odq |Odp Pr Odq¯d |Odp¯d Pr Ovq , where w(p, q, d)(= w(q, p, d)) is the support-weight of q for a disparity d. p¯d and q¯d are the corresponding pixels in the target image when p and q in the reference image have a disparity d as shown in Fig 32 . Odp represents the event that the disparity of p is d as {dp = d}. Similarly, Odp¯d , Odq and Odq¯d represent the events {dp¯d = d}, {dq = d}, and {dq¯d = d}, respectively3 . The support-weight modeling is then specified as the modeling of the probabil ities, Pr Odq |Odp , Pr Odq¯d |Odp¯d , and Pr Ovq , as shown in Eq. (3). Pr Odq |Odp d d and Pr Oq¯d |Op¯d are the probabilities that neighboring pixels have the same disparity (i.e., are from the same depth) in each image. By combining them and with Pr Ovq , we can also consider the supports coming from occluded pixels. However, we do not know the disparities and visibilities of pixels beforehand because they are what we want to compute. For this reason, some methods [4, 19] iteratively update support windows or support-weights. The iterative methods, 1 2 3
It is possible to generalize the model for the N -view stereo problem. It is assumed without loss of generality that images are rectified so that the disparities of pixels are restricted to 1D scalar values. When p in the target image corresponds to p¯d in the target image with a disparity d, we also define the disparity of p¯d as d.
30
K.-J. Yoon, Y. Jeong, and I.S. Kweon
however, are very sensitive to the initial disparity estimation. To resolve this dilemma, we need some reasonable constraints or assumptions. In this work, we assume that depth discontinuities generally coincide with color or intensity edges . In addition, we assume that if a neighboring pixel is from the same depth and visible in both images, similar colors are observed at the same position in both images. Although these assumptions look very simple, we can develop an efficient method based on these assumptions.
3
Support-Weights Computation
According to Eq. (3), we should assign small support-weights to pixels that are probably coming from different depths or occluded. To this end, we first compute adaptive support-weights based on the color similarity and the distance between pixels in each image to block the supports coming from different depths, and then recompute the support-weights used for support aggregation according to disparities to block the supports coming from occluded areas. 3.1
Adaptive Support-Weights for Difference Depths
The adaptive support-weights are originated from the observation that pixels in a support area are not equally important in support aggregation in the human visual system. They are computed based on the color similarity and the proximity between the pixel under consideration and neighboring pixels as we proposed in [6]. The more similar the color of a pixel, the larger its support-weight. In addition, the closer the pixel is, the larger the support-weight. The supportweight of a neighboring pixel is defined using the Laplacian kernel as Δcpq Δgpq wr (p, q) = exp −( + ) , (4) γc γp where Δcpq and Δgpq represent the color difference and the distance between p and q in an image, respectively. γc and γp are control parameters. In the same manner, we can compute wt (¯ pd , q¯d ) as Δcp¯d q¯d Δgp¯d q¯d wt (¯ pd , q¯d ) = exp −( + ) . (5) γc γp These support-weights encourages the pixel having the high probability of the same depth with the pixel of interest. Therefore, wr (p, q) and wt (¯ pd , q¯d ) are related to Pr Odq |Odp and Pr Odq¯d |Odp¯d , respectively. Here, these supportweights are entirely based on the contextual information within given support areas and do not depend on the initial disparity estimation at all. 3.2
Disparity-Dependent Support-Weights for Occlusion
Although the support-weights computed in in Eq. (4) and Eq. (5) can prevent false supports coming from different depths, it is not possible as previous works to
Support Aggregation via Non-linear Diffusion
B A
B A
(a) reference image
(b) supportweights of (a)
B’ A
B’ A
(c) target image (d) supportweights of (c)
31
B A
(e) disparitydependent support-weights
Fig. 4. Disparity-dependent adaptive support-weights. The brighter pixels have larger support-weights in (b), (d), and (e).
prevent false supports coming from half-occluded pixels shown in Fig. 2, because these support-weights are computed in each image separately. For reducing the effect of half-occluded pixels (e.g., pixels in B in Fig. 2), support-weights are combined as the function of disparities to be considered. The support-weight of a neighboring pixel q in Np for the disparity d is computed by combining two support-weights in both images as w(p, q, d) =wr (p, q) × wt (¯ pd , q¯d ) Δcpq Δgpq Δcp¯d q¯d Δgp¯d q¯d = exp −( + ) × exp −( + ) . γc γp γc γp
(6)
Here, w(p, q, d) is equal to w(q, p, d) and these support-weights do not change during the whole process. The blocking of supports coming from occluded pixels is clearly shown in Fig. 4. Figure 4(a) and Fig. 4(c) show two local support windows and Fig. 4(b) and Fig. 4(d) show support-weights corresponding to Fig. 4(a) and Fig. 4(c) computed by Eq. (4) and Eq. (5), respectively. Here, the pixel A in Fig. 4(a) is corresponding to the pixel A in Fig. 4(c) with the disparity dA . On the other hand, the pixel B in Fig. 4(a) is not corresponding to the pixel B in Fig. 4(c) for dA because of occlusion. Therefore, the support coming from B to A should be suppressed during the aggregation step when considering the disparity dA . However, if we consider the referent image only and use the adaptive support-weights shown in Fig. 4(b), the pixel A may get the incorrect support from B (because two pixels have almost the same color) and this may result in a false match. However, if we adjust support-weights according to the considered disparity as shown in Fig. 4(e), we can effectively block the support coming from B — we can see that the pixel B actually has a very small support-weight in Fig. 4(e). 3.3
Support-Weights for Occluded Pixels
From Fig. 4, we can see that the disparity-dependent support-weights truly block the supports coming from occluded pixels. However, it is impossible to perfectly block incorrect supports as the iteration goes on. This is clearly because the model we use is not perfect. To complement this, we detect occluded pixels
32
K.-J. Yoon, Y. Jeong, and I.S. Kweon
at each iteration. Occluded pixels can be detected by thresholding the lowest aggregated cost as in [11] or by the left-right consistency check after applying the winner-take-all method to both images. Once the pixel is detected as an occluded pixel at the (n − 1)th iteration, the support-weight of the pixel at the nth iteration is given by using the binary function v(·) defined in Eq. (2) as 1 − v(q) wv (q) = exp − , (7) γv where γv is a control parameter.
4 4.1
Support Aggregation via Diffusion Non-Linear Diffusion of Supports
The non-linear diffusion for support aggregation can be simply achieved by using the disparity-dependent support-weights and the visibility function. Supports are iteratively aggregated as 1 v Sn (p, d) = w(p, q, d)wn−1 (q)Sn−1 (q, d), (8) z q∈Np
v where z is the normalization constant as z = q∈Np w(p, q, d)wn−1 (q) and n v denotes the iteration number. Initially, w0 (·) = 1 for all pixels. S0 (p, d) is the initial matching cost of p for disparity d, which can be computed using any function of image colors. 4.2
Stereo Matching Using the Proposed Diffusion
In this section, we propose a simple stereo method based on the proposed nonlinear diffusion of supports. Firstly, the initial per-pixel raw matching costs are computed by using the truncated absolute difference (AD) as S0 (p, d) = min{ |Ic (p) − Ic (¯ pd )|, T } (9) c∈{r,g,b}
where Ic is the intensity of the color band c and T is the truncation value that controls the limit of the matching cost. Using the initial per-pixel raw matching costs, the support aggregation is achieved by the proposed method. And, for the fair verification of the proposed support-aggregation method, we do not perform any complicated optimization, but we adopt the simplest winner-takeall method that selects the disparity of each pixel locally. The disparity with the lowest aggregated cost is selected as dp = arg min S(p, d), d∈Cd
(10)
where Cd = {dmin , · · · , dmax } is the set of all possible disparities. This winnertake-all is followed by a left-right consistency check for invalidating occlusions
Support Aggregation via Non-linear Diffusion
33
and mismatches. Because the disparity is simply determined without any complicated optimization, it is easy to apply the left-right consistency check. Invalid disparity areas are then filled by propagating neighboring small (i.e. background) disparity values as in [20] where they evaluate and compare cost functions for stereo matching. We also apply the median filter to resultant disparity maps to remove salt-and-pepper errors, because the support-weight computation using a single pixel color is sensitive to image sampling and image noise. The reason we perform these post-processing steps, as opposed to comparing raw results, is to reduce overall errors as performed in [20].
5
Experiments
We verify the efficiency of the proposed method by using the real images with ground truth, which are often used for performance comparison of various methods as in [3]. The proposed methods are run with a constant parameter setting across all images — the size of a local support window is fixed as (5 × 5) and the number of iterations is set 300 that is fully enough for the convergence for all images.
Fig. 5. Performance with respect to the number of iteration
(a) n=1
(b) n = 50
(c) n = 200
Fig. 6. Results according to the number of iterations
34
K.-J. Yoon, Y. Jeong, and I.S. Kweon
Table 1. Performance of the proposed method with other diffusion-based methods Tsukuba Sawtooth Venus nonocc untex disc nonocc untex disc nonocc untex disc Adapt. diff. Bay. diff [9] Stoch. diff. [21]
1.14 0.26 6.12 1.15 6.49 11.62 12.29 1.45 3.95 4.08 15.49 2.45
0.02 5.90 0.73 0.72 9.29 4.00 0.90 10.58 2.45
0.46 3.81 7.21 18.39 2.41 21.84
Table 2. Performance of the proposed method for new testbed images Tsukuba Venus Teddy Cone nonocc all disc nonocc all disc nonocc all disc nonocc all
disc
1.14 1.93 6.12 0.73 1.21 3.81 8.02 14.2 21.97 4.12 10.58 7.75
(a) left image
(b) ground truth
(c) our results
(d) bad pixels (error > 1) Fig. 7. Dense disparity maps for the ‘Tsukuba’, ‘Venus’, ‘Teddy’ and ‘Cone’ images
Support Aggregation via Non-linear Diffusion
35
Figure 5 and Fig. 6 show the performance of the proposed method for ‘Tsukuba’ images with respect to the iteration number. Unlike other support aggregation methods where the errors in depth discontinuity areas terribly increase as the iteration goes on (or as the size of the support area increases) as shown in Fig 1, the performance of the proposed method in textureless areas and in depth discontinuity areas remains almost constant without critical degradation after the small number of iterations. Therefore, the number of iteration is not so critical and no local stopping criteria is needed in the proposed method. This is the one of main advantages of the proposed method. The results for testbed images are given in Fig. 7. As shown in Fig. 7, the proposed method yields accurate results in all images areas (including depth discontinuity areas) for all testbed images. The performance is summarized in Table 1 and Table 2 to compare the performance with other state-of-the-art local methods. The numbers in Table 1 and Table 2 represent the percentage of bad pixels (i.e. pixel whose absolute disparity error is greater than 1). We can see that the proposed method yields comparable results even though we just use the simple winner-take-all method without any complicated processes4. Especially, the performance in depth discontinuity areas is much better than those of other diffusion-based methods. This shows the effectiveness of the proposed support aggregation method.
6
Conclusion
In this work, we have proposed a new non-linear diffusion method for support aggregation in stereo matching. The core of support aggregation is to block incorrect supports coming from different depths and occluded areas. To this end, we first compute adaptive support-weights based on color similarity and pixel distance between pixels in each image. Supports are then iteratively aggregated with the support-weights while adjusting the support-weights according to disparities. In spite of the simplicity of the proposed method, it produces accurate disparity maps. Experimental results show that the proposed method yields good results not only in homogeneous areas but also in depth discontinuity areas as the iteration goes on without the critical degradation of performance. This proves that considering both images at the same time when computing matching costs is helpful to reduce the effect of outliers such as pixels from different depths or half-occluded pixels.
References 1. Baker, S., Sim, T., Kanade, T.: A characterization of inherent stereo ambiguities. In: IEEE International Conference on Computer Vision, July 2001, pp. 428–437 (2001) 2. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 4
Because of the page limit and to preserve anonymity, we do not put the results of other local methods in Table 2. See http://cat.middlebury.edu/stereo/ to examine and to compare the results.
36
K.-J. Yoon, Y. Jeong, and I.S. Kweon
3. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47(1-3), 7–42 (2002) 4. Kanade, T., Okutomi, M.: A stereo matching algorithm with an adaptive window: theory and experiment. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(9), 920–932 (1994) 5. Veksler, O.: Stereo correspondence with compact windows via minimum ratio cycle. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12), 1654– 1660 (2002) 6. Yoon, K.J., Kweon, I.S.: Adaptive support-weight approach for correspondence search. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4), 650–656 (2006) 7. Bobick, A.F., Intille, S.S.: Large occlusion stereo. International Journal of Computer Vision 33(3), 181–200 (1999) 8. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) 9. Scharstein, D., Szeliski, R.: Stereo matching with nonlinear diffusion. International Journal of Computer Vision 28(2), 155–174 (1998) 10. Sun, J., Zheng, N.N., Shum, H.Y.: Stereo matching using belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(7), 787–800 (2003) 11. Zitnick, C., Kanade, T.: A cooperative algorithm for stereo matching and occlusion detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(7), 675–684 (2000) 12. Tappen, M.F., Freeman, W.T.: Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters. In: IEEE International Conference on Computer Vision, vol. 2, pp. 900–906 (2003) 13. Bhat, D.N., Nayar, S.K.: Ordinal measures for image correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(4), 415–423 (1998) 14. Zabih, R., Woodfill, J.: Non-parametric local transform for computing visual correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 151–158. Springer, Heidelberg (1994) 15. Kang, S.B., Szeliski, R., Chai, J.: Handling occlusions in dense multi-view stereo. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 103–110 (2001) 16. Kolmogorov, V., Zabih, R.: Computing visual correspondence with occlusions via graph cuts. In: IEEE International Conference on Computer Vision, vol. 2, pp. 508–515 (2001) 17. Yoon, K.J., Kweon, I.S.: Stereo matching with symmetric cost functions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2371–2377 (2006) 18. Yoon, K.J., Kweon, I.S.: Stereo matching with the distinctive similarity measure. In: IEEE International Conference on Computer Vision, pp. 1–7 (2007) 19. Prazdny, K.: Detection of binocular disparities 52, 93–99 (1985) 20. Hirschmuller, H., Scharstein, D.: Evaluation of cost functions for stereo matching. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 21. Lee, S.H., Kanatsugu, Y., Park, J.I.: Hierarchical stochastic diffusion for disparity estimation. In: IEEE Workshop on Stereo and Multi-Baseline Vision, pp. 111–120 (2001)
Manifold Estimation in View-Based Feature Space for Face Synthesis across Poses Xinyu Huang, Jizhou Gao, Sen-ching S. Cheung, and Ruigang Yang Center for Visualization and Virtual Environments University of Kentucky, USA
Abstract. This paper presents a new approach to synthesize face images under different pose changes given a single input image. The approach is based on two observations: 1. a series of face images of a single person under different poses could be mapped to a smooth manifold in the unified feature space. 2. the manifolds from different faces are separated from each other by their dissimilarities. The new manifold estimation is formulated as an energy minimization problem with smoothness constraints. The experiments show that face images under different poses can be robustly synthesized from one input image, even with large pose variations.
1
Introduction
Face synthesis has been an active research topic in computer vision since the 1990s. Synthesizing an unseen view of a face accurately and efficiently can definitely improve the quality of face recognition and face reconstruction. However, it remains to be a challenging problem due to pose, illumination, facial expression variations, and occlusions. It is also known that changes caused by pose and illumination are much larger than the changes of personal appearances [1]. In this paper, we propose a novel approach to synthesize unseen views across various poses given one input image. We assume that face relighting, pose estimation, and face alignment are solved using existing technologies. For example, in order to change the illumination condition of a face image (i.e., face re-lighting), lots of approaches have been developed recently (e.g., [2,3]). Most of them can even deal with cast shadows and saturated areas. Another important problem, face alignment that usually fits a 2D/3D face model to one input image could be solved across poses, illumination, and even partial occlusions (e.g., [4,5,6,7]). However, fitting a model to one input image accurately does not guarantee that this model is suitable for all other conditions since face fitting and synthesis from one 2D input itself is an ill-posed problem. For example, a face model generated from one image could fail to represent the same face at other poses, especially at large pose changes. In addition, though face models can be used to estimate poses, it turns out that appearance-based methods using non-linear dimensionality reduction technologies could obtain more accurate pose estimation, e.g., less than 2 degree in [8,9]. H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 37–47, 2010. c Springer-Verlag Berlin Heidelberg 2010
38
X. Huang et al.
In our approach, we first build multiple view-based Active Appearance Models (AAMs) [10] for a database of face images from different persons under various poses. Each AAM extracts shape and appearance parameters corresponding to a certain pose. After transforming the parameters from all AAMs into the same feature space, a face image can be mapped to a high-dimensional point in the unified feature space. Furthermore, we observe that a series of face images of a single person under different poses form a smooth manifold and the shape of manifolds from similar faces resemble each other. Given a new input face image, i.e., a new point in the unified feature space, the ultimate goal, therefore, is to estimate a new manifold corresponding to the input face at different poses. We test our method on the images obtained from 3D face Morphable Model (3DMM) [11] and CMU-PIE database [12]. Experiments show that this approach is able to synthesize faces effectively even with large pose changes. The rest of the paper is organized as followings. Section 2 reviews the related works. Section 3 provides our algorithm of manifold estimation. Section 4 shows our experimental results for face synthesis. Section 5 gives the conclusion and future work.
2
Related Works
There are many existing methods for face synthesis across poses. They can be divided into two categories, model-based methods [13,4,5,6], and appearancebased methods [14,15,16,17]. In [13], Vetter et al. propose that a 2D view of an object could be represented by a weighted sum of 2D views of other objects. The weights remain the same on the transformed views when the object belongs to a linear class. Similar to this idea, the morphable model [6] is formulated as a vector space of 3D shapes and surface textures that are spanned by a training set that contains 200 laser scanned 3D face models. 3D shape reconstruction is a fitting problem by minimizing the difference between the rendered model image and the input image. This method generates face models with good qualities and is widely used in many face applications. However, it is computationally expensive and needs a careful manual initialization and segmentation. Cootes et al. [10] first propose AAM to solve the face fitting problem. Romdhami et al. [5] extend it to a nonlinear model using kernel PCA to align faces across poses. A more general method using the Gaussian mixture model of shape-and-texture appearance is proposed in [18] and fitting is solved by the EM algorithm. In [19], a combined 2D+3D AAM is proposed for real-time face fitting. In [4], three separate AAMs are trained at different face poses. Assuming all the modeled features are visible, a linear model is used to represent the correlations between appearance in two views and appearance from a new pose can then be predicted given an input face image. In [14], Gross et al. estimate the Fisher Light-Field of the subject’s head and then perform the matching between the probe and gallery using the Fisher LightFields. In [16], Tenenbaum et al. first propose the bilinear model to separate
Manifold Estimation in View-Based Feature Space
39
the content (intra-class) and style (inter-class). This model is extended to multilinear model in [17] to separate face poses, expressions, and appearances. In [20], Li et al. further extend the model to the nonlinear case in which an input space is transformed to a feature space using the Gaussian kernel. In [15], Turk et al. first compare the parametric eigenspace and view-based eigenspace. Inspired by the view-based eigenspace, [21,22] show that appearance of one object forms a smoothly varying manifold in the eigenspace. The pose estimation and object recognition can be performed by distance computation between the projection of a new input and manifolds in an object database. This appearance model is further applied in visual tracking and recognition in video sequences [23].
3 3.1
Algorithm Manifolds of Parameters in Feature Space
In an AAM, a shape s, defined by a set of 2D facial markers, can be represented by a mean shape s¯ and a set of shape bases si : s = s¯ +
d
αi si
(1)
i=1
where α= (α1 , ..., αd )T is the shape parameter. Similarly, the appearance t of a face can be represented by a mean appearance t¯ and a set of appearance bases ti : t = t¯ +
d
β i ti
(2)
i=1
where β= (β1 , ..., β d )T is the appearance parameter. A mean shape s¯, a mean appearance t¯, shape bases si , and appearance bases ti are obtained by applying Principal Component Analysis (PCA) on a face database. Moreover, appearances defined by t in Equation (2) are shape-free textures by applying piece-wise affine transformations to the mean shape s¯. When new parameters αnew and βnew are given, we are able to compute snew and tnew separately and then warp the tnew back to its shape snew to obtain a new face image. Given one AAM, a face image can be uniquely defined by two high dimensional points, α∈ Rd in the shape parameter space and β∈ Rd in the appearance parameter space, where d is the number of PCA bases. However, due to the nonlinearity of pose changes, shape and appearance variations can not be modeled by a single AAM. Similar to the approach in [4], we build a mixture of AAMs that cover the whole pose changes. Furthermore, after n view-based feature spaces are transformed to the reference feature space (e.g., 90◦ at the frontal view) by multiplying n projection matrices, we observe that the transformed parameters form smooth manifolds. Each manifold corresponds to a series of face images of a single person under
40
X. Huang et al.
40 20 50
0
0
-20
-50 50 0
0
-20
-100
-40 -60
20
-50
-40
-60
40
-100
-50
(a) 60
60
40
40
20
20
0
0
-20
-20
-40
-40
-60 -100
-50
0
(c)
0
50
(b)
50
-60
-60
-40
-20
0
20
40
(d)
Fig. 1. Manifolds of shape parameters of 25 faces. Each color represents one face horizontally varying from 0◦ to 180◦ . (a) First three principal components. (b) First and second principal components. (c) First and third principal components. (d) Second and third principal components.
different poses. Figure 1 shows the distribution of shape parameters of 25 faces. Only the first three principal components are displayed for visualization purpose. The manifolds of appearance parameters have the similar distributions as Figure 1. The manifolds of parameters have two important properties, smoothness and separateness. Smoothness means that the shape of a manifold changes smoothly as poses vary. Separateness means that manifolds are separated from each other in the feature space. The more the two faces are similar, the closer the two corresponding manifolds are to each other. The manifolds used in our approach are quite different from those estimated in the input space directly using nonlinear dimensionality reduction technologies [24,25]. In their works, manifolds of all faces are grouped together so that pose changes are significantly larger than personal facial features. This is useful for pose estimation. However, it also makes the synthesis of new faces intractable. 3.2
Estimation of a New Manifold with One Input
Assuming only one d-dimensional shape parameter up ∈ Rd at the pth pose is given, the first step of our approach is to find a good approximation of this point from other database points xp = (xp1 , ...xpk ) at pth pose in the shape parameter
Manifold Estimation in View-Based Feature Space
41
space. The similar approximation can also be applied in the appearance space. This could be formulated as an energy minimization problem, arg minπi up − ki=1 πi xpi 2 , (3) k s.t. i=1 πi = 1 where π = (π1 , ..., πk ) is the reconstruction weights. Let Z = [..., xpi − up , ...] be a matrix with k column vectors, the linear system Z T Zπ = 1 is solved for the weights π. However, this system needs to be regularized if Z T Z is singular or nearly singular that happens when k > d. This makes the solution of weights π unstable [26]. When the number of different faces in the database k is much larger than the number of principal components d, we adopt Matching Pursuit (MP) [27] to compute the reconstruction weights. This is based on two reasons. First, MP is a good choice to compute weights on over-complete bases. Second, a better face synthesis could be obtained by a few nearest neighbors. For example, a young female’s face could be better synthesized by similar faces instead of all the faces including ones with beard. At each iteration of MP, the principal component xpγ is found such that it results in the maximal inner product with residual Ri up : Ri up = Ri up , xpγ xpγ + Ri+1 up xpγ = arg max |Ri up , xpγ | p
(4)
xγ 1≤γ≤k
The up is approximated by m basis after convergence, up ≈
m
πi xpγi
i=1 i p
πi = R u
(5)
, xpγi
Our second step is to estimate u at other poses. Since principal components are orthogonal to each other, synthesis could be done along each direction l. Given the weights π and m face bases that best reconstruct up , we assume, for simplicity, the weights and face bases remain the same at other poses. Thus, the estimation is formulated as a minimization problem of the following energy function E, E = E1 + λE2 n m E1 = πi xji,l − ujl 2 j=1
E2 =
n−1
i=1
wj 2ylj − ylj+1 − ylj−1 2 + w1 yl2 − yl1 2 + wn yln − yln−1 2
j=2
1 j j+1 j−1 2yi,l − yi,l − yi,l 2 ) m m
wj = exp(−
i=1
(6)
42
X. Huang et al. 90
50
0
-50
-100 0
20
40
60
80
100
120
140
160
180
Horizontal Pose
Fig. 2. Estimation of a new manifold at ith principal component given one point at 20 degree. Light green manifolds are the ones in the database. The dark blue manifold is the initial guess, and the red manifold is estimated by minimizing the energy function E.
where E1 is the prior knowledge of ujl at jth pose and lth principal component, E2 is the first-order smoothness constraint for the estimated manifold, y is the projection of u in the reference feature space, and wj is a weight associated to the average smoothness in the face database at jth pose. Figure 2 shows the estimation framework. The initial guess is obtained by translation of a manifold on which its point at pth pose is the nearest neighbor of the input point. Our approach is summarized in algorithm 1.
Algorithm 1. Manifold Estimation 1: Compute shape and appearance parameters in the reference feature spaces using Equation (1)- (2). 2: Compute reconstruction weights π at pth pose using Equation (3) or Matching Pursuit. 3: For each principal component, minimize energy function E in Equation (6) to obtain a new manifold.
A further intuitive attempt to obtain a more accurate manifold estimation is to model changes of reconstruction weights π instead of using fixed values. However, it turns out difficult and is not very useful from our experiments. The reason is that the variance of weights π is small and their distribution is arbitrary and highly person-dependent although changes of weights π are smooth. Therefore, it is a reasonable approximation to apply the same weights at other poses.
Manifold Estimation in View-Based Feature Space
4 4.1
43
Experiments Database
Face images in the database are generated semi-automatically based on 25 3D face models in [11]. These 25 3D models include persons from different ages, races, and genders. On each 3D model, 92 feature points are labeled. 2D locations of feature points are computed automatically while faces with rotations are rendered using 3ds Max . Poses are changed horizontally from 0◦ to 180◦ with 2◦ increments. 2D locations of occluded feature points and feature points on face outlines are further refined by pushing them to the face boundaries. A set of examples is shown in Figure 3.
Fig. 3. Examples in face database. Each face has 92 feature points.
4.2
Results of Manifold Estimation
Figure 4 shows a typical manifold estimation given the pose at 20 degree. Only 20 faces are used in the face database for training currently. Other 5 faces are testing cases. Nine principal components are used for shape and appearance parameters. Our estimation is close to the ground truth. In [13], the parameters α and β are fixed assuming a face belongs to a linear object class. It turns out that this is valid within small pose changes (around ±10 degrees). The estimation using MP is slightly better than the estimation using Equation (3). The improvements from MP should increase when more faces are used in the database. Notice that a manifold becomes less smooth when the variance of principal component getting smaller, which also contributes less to the face synthesis.
X. Huang et al.
-20
-30
-40 -45 -50 -55
-40
-60
-80
-60 -65
20
40 Pose
60
Intial Guess Estiamtion 1 Estimation 2 Ground Truth Linear Object Class
-100
80
20
(a)
40 Pose
500
2500
1000
-500
20
60
-15
80
20
40 Pose
60
80
(c) 400
100
-100
Initial Guess Estimation 1 Estimation 2 Ground Truth Linear Object Class
300 200 9th PC
1500
2nd PC
1st PC
2000
0
-5
-10
Initial Guess Estimation 1 Estimation 2 Ground Truth Linear Object Class
300
Initial Guess Estimation 1 Estimation 2 Ground Truth Linear Object Class 40 60 Pose
0
(b)
3000
500
Initial Guess Estimation 1 Estimation 2 Ground Truth Linear Object Class
5
-20
2nd PC
1st PC
-35
10
0
Intial Guess Estimation 1 Estimation 2 Ground Truth Linear Object Class
-25
9th PC
44
100 0 -100
-300
80
-500
(d)
-200 20
40 Pose
(e)
60
80
-300
20
40 Pose
60
80
(f)
Fig. 4. Manifold estimation results from 0◦ to 180◦ with an input image at 20◦ . Estimation 1 is generated using Equation (3). Estimation 2 is generated using Matching Pursuit. (a) 1st principal component of shape parameters. (b) 2nd principal component of shape parameters. (c) 9th principal component of shape parameters. (d) 1st principal component of appearance parameters. Notice that the initial guess, estimation 1, and estimation 2 are overlapped. (e) 2nd principal component of appearance parameters. (f) 9th principal component of appearance parameters.
4.3
Results of Face Synthesis
Figure 5 shows our results of face synthesis. Our approach is robust to synthesize faces from 0◦ to 180◦ given an input at any pose. Furthermore, the appearances of all the synthesized faces are consistent even with input at different poses (as shown in Figure 5, faces are synthesized from inputs at 20◦ , 90◦ , and 170◦ , respectively). Figure 6 shows the results from one input from CMU-PIE database. The illumination of input image is normalized roughly such that it is similar to the illumination condition in our database. This also can be done by existing face re-lighting algorithms. Notice that the input images in Figure 5 and Figure 6 (with white rectangles) are not close to the ground truth since shape and appearance parameters can not be computed very accurately using only 20 faces in the database. We plan to use a larger database in the future.
Manifold Estimation in View-Based Feature Space
45
Fig. 5. Synthesis results. The images with white rectangles are input images. The first row is the ground truth generated from 3D models in [11]. The second row is the synthesis results by given an input at 20◦ . The third row is generated by given an input at 170◦ . The fourth row is generated by given an input at 90◦ .
Fig. 6. Synthesis results from one input in CMU-PIE database [12]. The first row is generated by our approach with an input at 90 degrees. The second row is the ground truth.
5
Conclusion
In this paper, we propose a novel approach for face synthesis across poses. By applying manifold estimation in the unified view-based feature spaces, this approach is able to synthesize unseen views, even for large pose changes. Moreover, it is straightforward to extend this approach to handle multiple inputs that can improve the estimation accuracy. In the future, we will test and refine our approach on a large face database that should be able to improve the synthesis quality and apply this method in different areas, such as face recognition and 3D face reconstruction.
46
X. Huang et al.
References 1. Romdhani, S., Ho, J., Vetter, T., Kriegmann, D.: Face Recognition Using 3D Models: Pose and Illumination. Proceedings of the IEEE 94(11) (2006) 2. Wang, Y., Liu, Z., Hua, G., Wen, Z., Zhang, Z., Samaras, D.: Face Re-Lighting from a Single Image under Harsh Lighting Conditions. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2007) 3. Luong, Q., Fua, P., Leclerc, Y.: Recovery of reflectances and varying illuminants from multiple views. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 163–179. Springer, Heidelberg (2002) 4. Cootes, T.F., Walker, K., Taylor, C.J.: View-Based Active Appearance Models. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 227–232 (2000) 5. Romdhani, S., Psarrou, A., Gong, S.: On Utilising Template and Feature-Based Correspondence in Multi-view Appearance Models. In: European Conference on Computer Vision, pp. 799–813 (2000) 6. Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. IEEE Transaction on Pattern Analysis and Machine Intelligence 25(9), 1063–1074 (2003) 7. Gu, L., Kanade, T.: 3D Alignment of Face in a Single Image. In: IEEE Conference on Computer Vision and Pattern Recognition (2006) 8. Balasubramanian, V.N., Ye, J., Panchanathan, S.: Biased Manifold Embedding: A Framework for Person-Independent Head Pose Estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7 (2007) 9. Fu, Y., Huang, T.S.: Graph Embedded Analysis for Head Pose Estimation. In: International Conference on Automatic Face and Gesture Recognition, pp. 3–8 (2006) 10. Cootes, T.F., Edwards, G., Taylor, C.J.: Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001) 11. Blanz, V., Vetter, T.: A Morphable Model for the Synthesis of 3D Faces. In: SIGGRAPH, pp. 187–194 (1999) 12. Sim, T., Baker, S., Bsat, M.: The cmu pose, illumination, and expression (pie) database of human faces. Technical Report CMU-RI-TR-01-02, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA (January 2001) 13. Vetter, T., Poggio, T.: Linear Object Classes and Image Synthesis From a Single Example Image. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 733–742 (1997) 14. Gross, R., Matthews, I., Baker, S.: Fisher Light-Fields for Face Recognition Across Pose and Illumination (2002) 15. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 16. Tenenbaum, J.B., Freeman, W.T.: Separating Style and Content with Bilinear Models. Neural Computation 12(6), 1247–1283 (2000) 17. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear Analysis of Image Ensembles: TensorFaces. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 447–460. Springer, Heidelberg (2002) 18. Christoudias, C., Darrell, T.: On modelling nonlinear shape-and-texture appearance manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition (2005)
Manifold Estimation in View-Based Feature Space
47
19. Xiao, J., Baker, S., Matthews, I., Kanade, T.: Real-Time Combined 2D+3D Active Appearance Models. In: IEEE Conference on Computer Vision and Pattern Recognition (2004) 20. Li, Y., Du, Y., Lin, X.: Kernel-Based Multifactor Analysis for Image Synthesis and Recognition. In: IEEE International Conference on Computer Vision, pp. 114–119 (2005) 21. Murase, H., Nayar, S.K.: Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision 14(1), 5–24 (1995) 22. Graham, D., Allinson, N.: Face recognition from unfamiliar views: subspace methods and pose dependency. In: IEEE International Conference on Automatic Face and Gesture Recognition (1998) 23. Lee, K.C., Ho, J., Yang, M.H., Kriegman, D.: Video-based face recognition using probabilistic appearance manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition (2003) 24. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290(5500), 2319–2323 (2000) 25. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290(5500), 2323–2326 (2000) 26. Zhang, Z., Wang, J.: MLLE: Modified Locally Linear Embedding Using Multiple Weights. In: Neural Information Processing Systems, NIPS (2006) 27. Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Transactions on Acoustics, Speech, and Signal Processing, 3397–3415 (1993)
Estimating Human Pose from Occluded Images Jia-Bin Huang and Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced {jbhuang,mhyang}@ieee.org
Abstract. We address the problem of recovering 3D human pose from single 2D images, in which the pose estimation problem is formulated as a direct nonlinear regression from image observation to 3D joint positions. One key issue that has not been addressed in the literature is how to estimate 3D pose when humans in the scenes are partially or heavily occluded. When occlusions occur, features extracted from image observations (e.g., silhouettes-based shape features, histogram of oriented gradient, etc.) are seriously corrupted, and consequently the regressor (trained on un-occluded images) is unable to estimate pose states correctly. In this paper, we present a method that is capable of handling occlusions using sparse signal representations, in which each test sample is represented as a compact linear combination of training samples. The sparsest solution can then be efficiently obtained by solving a convex optimization problem with certain norms (such as l1 -norm). The corrupted test image can be recovered with a sparse linear combination of un-occluded training images which can then be used for estimating human pose correctly (as if no occlusions exist). We also show that the proposed approach implicitly performs relevant feature selection with un-occluded test images. Experimental results on synthetic and real data sets bear out our theory that with sparse representation 3D human pose can be robustly estimated when humans are partially or heavily occluded in the scenes.
1 Introduction Estimating 3D articulated human pose from single view is of great interest to numerous vision applications, including human-computer interaction, visual surveillance, activity recognition from images, and video indexing as well as retrieval. Notwithstanding some demonstrated success in the literature, this problem remains very challenging for several reasons. First, recovering 3D human poses directly from 2D images is inherently ambiguous due to loss of depth information. This problem is alleviated with additional information such as temporal correlation obtained from tracking, dynamics of human motion and prior knowledge, or multiple interpretations conditioned on partial image observations. In addition, the shape and appearance of articulated human body vary significantly due to factors such as clothing, lighting conditions, viewpoints, and poses. The variation of background scenes also makes the pose estimation more difficult. Therefore, designing image representations that are invariant to these factors is critical for effective and robust pose estimation. H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 48–60, 2010. c Springer-Verlag Berlin Heidelberg 2010
Estimating Human Pose from Occluded Images
49
Human pose estimation algorithms can be categorized as generative (model-based) and discriminative (model-free). Generative methods employ a known model (e.g., tree structure) based on prior knowledge [1]. The pose estimation process includes two parts: 1) modeling: constructing the likelihood function and 2) estimation: predicting the most likely hidden poses based on image observations and the likelihood function. However, it is difficult to consider factors such as camera viewpoint, image representations and occlusion in the likelihood functions. Furthermore, it is computationally expensive to compute these functions and thus makes them unsuitable for inferring the hidden poses. In contrast, discriminative methods do not assume a particular human body model, and they can be further categorized as example-based [2] and learning-based [3,4,5]. Example-based approaches store a set of training samples along with their corresponding pose descriptors. For a given test image, a similarity search is performed to find similar candidates in training set and then obtain estimated poses by interpolating from their poses [2]. On the other hand, learning-based approaches learn the direct mapping from image observations to pose space using training samples [3,4,5]. While generative methods can infer poses with better precision than discriminative ones, discriminative approaches have the advantage in execution time. Several image representations have been proposed in discriminative pose estimation algorithms such as shape context of silhouettes [6], signed-distance functions on silhouettes [7], binary principal component analysis of appearance [8], and mixture of probabilistic principal component analysis on multi-view silhouettes [2]. However, silhouettes are inherently ambiguous as different 3D poses can have very similar silhouettes. In addition, clean silhouette can be better extracted with robust background subtraction methods, which is not applicable in many real-world scenarios (e.g., videos with camera motion, dynamic background, sudden illumination change, etc.). To cope with this problem, appearance features like block SIFT descriptors [9], Haar-like features [10], Histogram of oriented gradients (HOG) [6,11,12] or bag-of-visual-words representations [13] have been proposed for pose estimation. These descriptors contain richer information than silhouette-based features, but they inevitably encode irrelevant background clutter into the feature vector. These unrelated feature dimensions may have accumulative negative effects on learning the image-to-pose mapping and thereby increase errors in pose estimation. Agarwal et al. [6] deal with this problem by using non-negative matrix factorization to suppress irrelevant background features, thereby obtaining most relevant HOG features. In [10], relevant features are selected from a predefined set of Haar-like features through multi-dimensional boosting regression. Okada and Soatto [12] observed that the components related to human pose in a feature vector are pose dependent. Thus, they first extract pose clusters using kernel support vector machine, and then train one local linear regressor for each cluster with features selected from the cluster. Another important issue that has not been explicitly addressed in the literature is how to robustly estimate 3D pose when humans in the scenes are partially or heavily occluded. When parts of a human body are occluded, the extracted descriptors from image observation (e.g., shape features from silhouettes, block SIFT, HOG, or part-based features, etc.) are seriously corrupted. The learned regressor, induced from
50
J.-B. Huang and M.-H. Yang
un-occluded images, is not able to estimate pose parameters correctly when a human is occluded in an image. While using tracking algorithm or making use of human motion prior may alleviate this problem, an effective approach is needed to explicitly handling occlusion. In this paper, we show we are able to deal with such problems using sparse image representations in which each test sample can be represented as a compact linear combination of training samples, and the sparest solution can be obtained via solving a convex optimization problem with certain norms (such as l1 -norm). Within this formulation, the corrupted test image can be recovered with a linear combination of un-occluded training images which can then be used for estimating human pose correctly (as if no occlusions exist). The proposed algorithm exploits both the advantages of example-based and learning-based algorithms for pose estimation. In our algorithm, when we represent a given image as a linear combination of training samples and obtain a sparse solution, we are actually searching for a small number of candidates in the training data set that best synthesizes the test sample. It is similar to the idea of example-based approaches which perform efficient nearest neighbor search, but yet we use a more compact representation that has been proven to be effective in dealing with noise. We then learn a mapping between the compact representation and their corresponding pose space using regression functions. The major difference between sparse image representation and example-based approach (nearest neighbor search) is that we consider all possible supports and adaptively select the minimal number of training samples required for representing each test sample. Hence, with the recovered test sample, we can estimate 3D human pose when humans in the scenes are partially or heavily occluded. Moreover, by using sparse representations we can implicitly perform relevant feature selection. When representing each test sample as a compact linear combination of training samples, those mismatched components are treated as part of reconstruction error and discarded directly. Intuitively, we are replacing the background clutter in the test samples with backgrounds in the training images. In this way, we achieve posedependent feature selection without making any approximation (like clustering poses in [12] or bag-of-visual-words in [13]) and avoid the need to increase the complexity of the learning-based algorithms. The contributions in this paper can be summarized in two main aspects. First, we propose an algorithm to handle occlusion in estimating 3D human pose by representing each test sample as a sparse linear combination of training samples. The prediction errors are significantly reduced by using the reconstructed test samples instead of the original ones when human in images are occluded. Second, we achieve pose-dependent feature selection by solving sparse solution with reconstruction error. Our approach improves over the learning-based algorithm without feature selection. The remainder of this paper is organized as follows. Section 2 describes related works on human pose estimation. In Section 3, we introduce the proposed image representation scheme. We test our approach on both synthesized (INRIA) and real data set (HumanEva I) to demonstrate the ability to handle occlusion and feature selection in Section 4. We conclude this paper with comments on future work in Section 5.
Estimating Human Pose from Occluded Images
51
2 Related Work Due to its scope and potential applications, there has been a substantial amount of work on the general problem of human motion capture and understanding. As such, we find it useful to place the focus of our work within the taxonomy proposed by Moedlund and Granum [14] whereby the field of work is presented in the categories of person detection, tracking, pose estimation and recognition. Out approach fits best into the category of pose estimation where the goal is to accurately estimate the positions of the body parts. More specifically, our approach is to estimate 3D pose from a single image without the use of temporal information. We will focus on previous work with a similar goal and leave interested readers to consult one of the surveys for a more complete listing of work in this general area [14,15]. Previous approaches to human pose estimation from a single image can be broadly categorized as model-based or model-free based. In model-based approaches a parametric model that captures the kinematics of the human body is explicitly defined. This model can be used in a predict-match-update paradigm in which maximal agreement between the model and the image measurements is sought. One method for this is to simultaneously detect body parts and assemble them in a bottom-up manner. Pictorial structures [16] presented a convenient discrete graphical form for this that can be adapted for people using an efficient dynamic programming minimization proposed by Felzenszwalb and Huttenlocher [17] and later used in various forms by a number of researchers [18,19,20]. Mori et al. followed a similar line of thought, but employed “superpixels” for the task of segmenting and detecting body parts [21]. Sigal et al. presented a bottom-up approach in a continuous parameter space using a modified particle filter for the minimization [1]. In contrast, Taylor developed a method to invert a kinematic model given an accurate labeling of joint coordinates that provides reconstruction up to a scale ambiguity [22]. This method was combined with shape-context matching in a fully automatic system by Mori and Malik [23]. Model-free based approaches, which include regression and example based methods, take a top-down approach to this problem and attempt to recover a mapping from image feature space to pose parameter space directly. An early approach of this type represented 3D pose space as a manifold that could be approximated by hidden Markov models [24]. Agarwal and Triggs advocated the relevance vector machine (RVM) [25] to learn this mapping where silhouette boundary points were used as features [26]. Rosales and Sclaroff used specialized maps in addition to an inverse rendering process to learn this mapping [27]. Along a different line, Shakhnarovich do not learn a regression function, but instead directly make use of training examples in a lookup table using an efficient hashing [28]. The feature space used for these types of methods, with few exceptions, is global in the sense that the features carry no information about the body region they describe. This provides a clean top-down approach that circumvents any need to implement part detectors. One exception to this is recent work by Agarwal and Triggs where the goal is pose estimation in cluttered environments that localized feature with respect to the window of interest [6]. Our approach uses a regression model to learn the mapping from image feature space to pose space, but differs from previous work in that sparse representations are learned from examples with demonstrated ability to handle occlusions.
52
J.-B. Huang and M.-H. Yang
3 Image Representation We represent each input image observation as x ∈ IRm and the output 3D human pose vector as y ∈ IRk . Given a training set of N labeled examples {(xi , yi )|i = 1, 2...N }, the goal of a typical learning-based approach in human pose estimation is to learn a smooth mapping function that generalizes well for unseen image observation b in the testing set. As mentioned in the Section 1, straightforward appearance features inevitably encode unwanted background information in x, which may introduce significant errors in estimating pose from the test samples since the background clutters may be quite different. The performance of the learned mapping function will also be seriously degraded if humans are occluded in images because part of feature dimensions are corrupted. To address these two problems, we present a formulation to represent test samples such that the occluded or the irrelevant parts of the test samples can be recovered by solving convex optimization problems. 3.1 Test Image as a Linear Combination of Training Images Given sufficient number of training samples, we model a test sample b by the linear combination of the N training samples: b = ω1 x1 + ω2 x2 + · · · + ωN xN ,
(1)
where ωi , i ∈ {1, 2, . . . , N } are the scalar coefficients denoting the weights of the i-th training sample contributing for synthesizing the test samples b. By arranging the N training samples as columns of a matrix A = [x1 , x2 , · · · , xN ] ∈ IRm×N , the linear representation of b can be written compactly as b = Aω,
(2)
where ω = [ω1 , ω2 , . . . , ωN ]T is the coefficient vector. With this formulation, each test sample b can be represented using the corresponding coefficient vector ω by solving the linear system of equations b = Aω. If the number of the dimension of the image observation m is larger than the number of training samples N , then the unique solution for ω can usually be obtained by solving the overdetermined system. However, with data noise or if N > m, then the solution is not unique. Conventionally, the method of least squares can be used to find an approximate solution to this case by solving minimum l2 -norm solution: min ||ω||2 subject to
Aω = b.
(3)
For the system Aω = b, the minimum l2 -norm solution can be obtained by ω ˆ2 = (AT A)−1 AT b. However, the minimum l2 -norm (energy) solution ω ˆ 2 is usually dense (with many nonzero entries), thereby losing the discriminative ability to select the most relevant training samples to represent the test one. As the vectors of pose parameters for articulated human body pose reside in an high-dimensional space, the resulting pose variations are large and diverse. It is reasonable to assume that only very a small portion of training samples are needed to synthesize a test sample (i.e., only a few nonzero terms in the solution ω ˆ for solving Aω = b). This is especially true when the training set contain a large number of examples that densely cover the pose space.
Estimating Human Pose from Occluded Images
53
3.2 Finding Sparse Solutions via l1 -Norm Minimization To find the sparest solution to Aω = b, we can solve the optimization problem in (2) with l0 -norm min ||ω||0 subject to Aω = b, (4) where l0 -norm counts the nonzero entries in the vector ω. However, solving the l0 -norm minimization problem is both numerically unstable and NP-hard (no polynomial time solutions exist). Recent theories from compressive sensing [29,30,31,32] suggest that if the solution of ω is sparse enough, then the sparsest solution can be exactly recovered via the l1 norm optimization: min ||ω||1 subject to Aω = b, (5) where the l1 -norm sums up the absolute weights of all entries in ω (i.e., ||ω||1 := |ω |, i where ωi stands for the i-th entry in the vector). This is a convex optimizai tion problem that can be solved by linear programming methods (e.g., generic pathfollowing primal-dual algorithm) [33], also known as basis pursuit [34]. 3.3 Coping with Background Clutter and Occlusion Although sparse solution for the coefficient ω can be obtained by solving an l1 optimization in (5), in the context of human pose estimation we may not find the sparest solution ω ˆ 1 that well explains the similarity between the test sample b and the training samples A. This can be explained with several factors. First, the background clutter may be quite different been training and testing samples, and thus there exist inevitable reconstruction errors when representing the test sample by training samples. For example, even the test sample contains pose exactly the same as one of the training samples, the background could be quite different, causing reconstruction error in representing the test sample. Second, when humans in the test images are occluded, the linear combination of training samples may not able to synthesize the occluded parts. Third, if we use dense holistic appearance features such as HOG or block SIFT, there may have misalignments within the detected image regions. To account for these errors, we introduce an error term e and then modify (2) as ω b = Aω + e = [A I] = Bv, (6) e where B = [A I] ∈ IRm×(N +m) and v = [ω e]T . If the vector v is sparse enough, the sparest representation can be obtained by solving the extended l1 -norm minimization problem: min ||v||1 subject to Bv = b (7) In this way, the first N entries of vector v obtained from solving (7) correspond to the coefficients of the training samples that can represent the test sample best using minimum nonzero entries. On the other hand, the latter m entries account for those factors (occlusion, misalignment, and background clutter) which can not be well explained by the training samples.
54
J.-B. Huang and M.-H. Yang
Fig. 1. Occlusion recovery on a synthetic dataset. (a)(b) The original input image and its feature. (c) Corrupted feature via adding random block. (d) Recovered feature via find the sparsest solution (7). (e) Reconstruction error.
(a)
(b)
(c)
(d)
Fig. 2. Feature selection example. (a) Original test image. (b) The HOG feature descriptor computed from (a). (c) Recovered feature vector by our algorithm. (d) The reconstruction error.
We validate the recovery ability of our approach using a synthetic data set [26] in which 1927 silhouette images are used for training and 418 images for testing. These images are first manually cropped and aligned to 128 × 64 pixels. For efficiency, we further downsample these images by a factor of 4 and add random blocks to simulate the occluded silhouettes. Fig. 1 shows that we can recover from the corrupted test feature (c) to (d). The reconstructed feature vector (d) can then be used for regressing the output 3D joint angle vector. We also demonstrate that our algorithm, as a result of using sparse representation, is able to perform feature selection implicitly by discarding irrelevant background information in the feature vectors in Fig. 2. Fig. 2 shows the original test image, the corresponding HOG feature vector, and the recovered feature vector, and the reconstruction errors using our sparse representations (from (a) to (d)). Note that most of the reconstruction errors appear at the locations corresponded to background clutters, thereby validating our claim that the proposed sparse representation is able to filter out irrelevant noise.
4 Experimental Results We test the proposed algorithm on synthetic [26] and real [4] data sets for empirical validation. In all experiments, we use Gaussian process regressor [35] to learn the mapping between image features and the corresponding 3D pose parameters. We first
Estimating Human Pose from Occluded Images
55
demonstrate the proposed method is able to estimate human pose from images with occlusions. Even without occlusions, we show that the our algorithm still outperforms the baseline methods as a result of implicit feature selection within our formulation. 4.1 Robustness to Occlusion We use the synthetic data set in [26] to show that the proposed algorithm is able to recover the un-occluded silhouettes from occluded ones. We generate random blocks (with their width corresponds to the corruption level (CL)) to the all test sample to synthesize occluded image silhouettes (see Fig. 3 for some sample test images under various corruption level). We use two different feature representations in our experiment. The first one is the principle component analysis (PCA) where each test image is represent by its first 20 coefficients of principal components. The second image feature is based on the image appearance (i.e., pixel values) of the downsampled images. Fig. 4 shows the average errors in angles (degree) for three experiment settings: 1) features extracted from original test image (baseline), 2) features computed from the corrupted images (see Fig. 3), and 3) recovered features using the proposed algorithm. First we note that in both PCA and appearance settings, the proposed algorithm improves the accuracy of pose estimation under occlusions. We also observe that our method with appearance features (e.g., downsampled images) performs better than that with holistic features (e.g., PCA). This can be explained by the fact holistic PCA is known to be sensitive to outliers. Thus, when a silhouette is occluded, the PCA coefficients computed from the occluded images are likely to be very different from the ones without occlusions. In contrast, only a small number of pixels of the occluded images have been changed or corrupted, thereby facilitating the process of recovering the unoccluded images. These results suggest that sparse and localized feature representations are suitable for pose estimation from occluded images. To further gauge the performance of the proposed method, we use the synchronized image and motion capture data from the HumanEva data sets [4] for experiments. The HumanEva I data set consists of 4 views of 4 subjects performing a set of 6 predefined actions (walking, jogging, gesturing, throwing/catching, boxing, combo) 3 times. For efficiency and performance analysis, we chose the common motion walking sequences of subjects S1, S2, S3 for experiments. Since we are dealing with pose estimation from
(a) CL=0.1 (b) CL=0.2 (c) CL=0.3 (d) CL=0.4 (e) CL=0.5 (f) CL=0.6 Fig. 3. Sample test images under various corruption levels (CL) in the synthetic data set. The occlusions seriously corrupt the shape of the silhouette images.
56
J.-B. Huang and M.-H. Yang
Fig. 4. Average error of pose estimation on synthetic data set using different features: (a) principle component analysis with 20 coefficients. (b) downsampled (20×20) images.
one single view, we use the images (2950 frames) taken from the first camera (C1). The original HumanEva data set is partitioned into training, validation, and test subsets (where the test subset is held out by [4]). For each subject, we use a subset of the training set to train a Gaussian precess regressor [35] and test on a subset of the original validation where both the images and motion capture data are available. As there are no occluded cases in the original HumanEva data set, we randomly generate two occluding blocks in the test images with various corruption level for synthesizing images with occlusions. The center locations of these blocks are randomly chosen within images and the block widths are correlated with the correction level. The aspect ratio of each block are sampled from a uniform distribution between 0 and 1. In Fig. 5, we show sample images taken from the walking sequence of three subjects with various corruption levels. The corruption level ranges from 0.1 to 0.6. We can see that although human vision can still infer the underlying poses under occlusion, it is difficult for pose estimation algorithms to handle such test images due to heavy occlusions.
(a) CL=0.1 (b) CL=0.2 (c) CL=0.3 (d) CL=0.4 (e) CL=0.5 (f) CL=0.6 Fig. 5. Synthesized images with occlusions with HumanEva data set I (all walking sequence). Top row: Subject 1, Second row: Subject 2, and Third row: Subject 3. Each corrupted test image contains two randomly generated blocks with their widths equal to corruption level (CL) times original image width and with their centers located at the position from uniformly random sample from image. Each column shows the sample corruption at certain corruption level.
Estimating Human Pose from Occluded Images
(a) S1
(b) S2
57
(C) S3
Fig. 6. Results of pose estimation on HumanEva data set I in walking sequences. (a) Subject 1. (b) Subject 2. (c) Subject 3. Images in the first row show the 3D mean errors of relative joint position in millimeters (mm) under various corruption level (from 0.06 to 0.6). The blue lines indicate the results from the original test samples, thus the predicted errors are independent of corruption level. The green curves stand for the results from the corrupted test samples with different level of corruption and the red curve are the results from recovered test samples using sparse signal representation.
We use histograms of oriented gradients as our feature vectors to represent training and test images. In our experiments, we compute the orientation of gradients in [0, π] (unsigned) and construct the histograms using 6 bins in each cell. We use 10×10 pixels per cell, 3×3 cells per block, and uniformly place 3×3 blocks overlapping with neighbor blocks by one cell. Thus, for each image window we obtain a 486-dimensional feature vector. We then learn the mapping function between the feature vectors and their corresponding pose parameters. We carry out a series of experiments with three different settings: 1) HOG feature vectors from original testing images without synthetically generated occluding blocks, 2) corrupted HOG feature vectors computed from the occluded images (see Fig. 5), and 3) the recovered test feature vectors by solving the extended l1 -norm minimization problem (7). In the third setting, after solving (7), we discard the reconstruction error vector e and use Aω as our recovered feature vector. All feature vectors obtained in the above three settings are used to regress the pose vector using Gaussian process regressor. We present in Fig. 6 the mean errors of relative joint positions on the testing sub-set of HumanEva data set under various corruption levels (from 0.06 to 0.6). We show the increasing error curves on three settings in terms of joint position error in millimeters of our approach over the baseline (i.e., using HOG feature vectors computed from occluded images.) In all three subjects, we show that from occluded images our approach is able to recover the un-occluded images and then the pose parameters. It is also worth noting that our algorithm also often outperforms the baseline algorithm (trained and tested on un-occluded images). This can be explained by the fact that our algorithm also implicitly performs feature selection whereas the performance of the baseline algorithm is inevitably affected by noise contained in the training data. 4.2 Robustness to Background Clutter In this section, we show that the proposed method is able to select relevant feature vectors. We use the same 486 dimensional HOG feature vectors to describe the image observation. We compare two settings: 1) HOG features computed from the original test
58
J.-B. Huang and M.-H. Yang
Fig. 7. Mean 3D error plots for the walking sequences (S2). The blue line indicates the errors by using the original test samples. The green line represents the error predicted from recovered feature vectors by the proposed algorithm. The results are comparable or better than the original test samples thanks to the ability of selecting relevant feature entries.
image sequences, and 2) features extracted from our sparse representation. The mean 3D joint position errors (mm) for each frame are plotted in Fig. 7 for the whole test set. The blue and green error curves correspond to the results using the original HOG feature vectors and the ones extracted from our method, respectively. The improvements (i.e. reduction) of mean position errors (mm) of our method in three subjects are 4.89, 10.84, and 7.87 for S1, S2 and S3, respectively.
5 Conclusion In this paper, we have presented a method capable of recovering 3D human pose when a person is partially or heavily occluded in the scene from monocular images. By representing the test images as a sparse linear combination of training images, the proposed method is able to recover the set of coefficients from the corrupted test image with minimum error via solving l1 -norm minimization problem, and therefore obtains robust pose estimation results. In addition, our algorithm improves the pose estimation accuracy even on images without occlusions by implicitly selecting relevant features and discarding unwanted noise from background clutter. Our future work includes more experiments with real image data where synchronized ground truth pose parameters and occluded images are available. We also plan to extend our sparse representation algorithm to temporal domain, making use of motion dynamics to further help disambiguate different poses with similar image observations.
References 1. Sigal, L., Isard, M., Sigelman, B., Black, M.: Attractive people: Assembling loose-limbed models using non-parametric belief propagation. In: NIPS, pp. 1539–1546 (2004) 2. Grauman, K., Shakhnarovich, G., Darrell, T.: Inferring 3d structure with a statistical imagebased shape model. In: ICCV, pp. 641–647 (2003)
Estimating Human Pose from Occluded Images
59
3. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3d human motion estimation. In: CVPR, pp. 390–397 (2005) 4. Sigal, L., Black, M.: Predicting 3d people from 2d pictures. In: Perales, F.J., Fisher, R.B. (eds.) AMDO 2006. LNCS, vol. 4069, pp. 185–195. Springer, Heidelberg (2006) 5. Bo, L., Sminchisescu, C., Kanaujia, A., Metaxas, D.: Fast algorithms for large scale conditional 3d prediction. In: CVPR (2008) 6. Agarwal, A., Triggs, B.: A local basis representation for estimating human pose from cluttered images. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3851, pp. 50–59. Springer, Heidelberg (2006) 7. Elgammal, A., Lee, C.: Inferring 3d body pose from silhouettes using activity manifold learning. In: CVPR, vol. 2, pp. 681–688 (2004) 8. Jaeggli, T., Koller-Meier, E., Gool, L.V.: Learning generative models for multi-activity body pose estimation. IJCV 83(2), 121–134 (2009) 9. Sminchisescu, C., Kanaujia, A., Metaxas, D.: Bm3 e: Discriminative density propagation for visual tracking. PAMI 29(11), 2030–2044 (2007) 10. Bissacco, A., Yang, M.H., Soatto, S.: Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In: CVPR, pp. 1–8 (2007) 11. Poppe, R.: Evaluating example-based pose estimation: experiments on the HumanEva sets. In: IEEE Workshop on Evaluation of Articulated Human Motion and Pose Estimation (2007) 12. Okada, R., Soatto, S.: Relevant Feature Selection for Human Pose Estimation and Localization in Cluttered Images. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 434–445. Springer, Heidelberg (2008) 13. Ning, H., Xu, W., Gong, Y., Huang, T.: Discriminative learning of visual words for 3d human pose estimation. In: CVPR (2008) 14. Moeslund, T., Granum, E.: A survey of computer vision-based human motion capture. Computer Vision and Image Understanding 81(3), 231–268 (2001) 15. Gavrila, D.: The visual analysis of human movement: A survey. Computer Vision and Image Understanding 73(1), 82–98 (1999) 16. Fischler, M.A., Elschlager, R.A.: The representation and matching of pictorial structures. IEEE Transactions on Computers 22(1), 67–92 (1973) 17. Felzenszwalb, P., Huttenlocher, D.: Efficient matching of pictorial structures. In: CVPR, vol. 2, pp. 2066–2073 (2000) 18. Ronfard, R., Schmid, C., Triggs, B.: Learning to parse pictures of people. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 700– 714. Springer, Heidelberg (2002) 19. Ioffe, S., Forsyth, D.: Probabilistic methods for finding people. IJCV 43(1), 45–68 (2001) 20. Ramanan, D., Forsyth, D.: Finding and tracking people from the bottom up. In: CVPR, vol. 2, pp. 467–474 (2003) 21. Mori, G., Ren, X., Efros, A., Malik, J.: Recovering human body configurations: Combining segmentation and recognition. In: CVPR, vol. 2, pp. 326–333 (2004) 22. Taylor, C.J.: Reconstruction of articulated objects from point correspondence using a single uncalibrated image. In: CVPR, vol. 1, pp. 667–684 (2000) 23. Mori, G., Malik, J.: Estimating human body configurations using shape context matching. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part III. LNCS, vol. 2352, pp. 666–680. Springer, Heidelberg (2002) 24. Brand, M.: Shadow puppetry. In: ICCV, pp. 1237–1244 (1999) 25. Tipping, M.: Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, 211–244 (2004) 26. Agarwal, A., Triggs, B.: Recovering 3d human pose from monocular images. PAMI 28(1), 44–58 (2006)
60
J.-B. Huang and M.-H. Yang
27. Rosales, R., Sclaroff, S.: Learning body pose via specialized maps. In: NIPS, pp. 1263–1270 (2001) 28. Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parameter-sensitive hashing. In: ICCV, pp. 750–757 (2003) 29. Candes, E., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52(2), 489–509 (2006) 30. Candes, E., Tao, T.: Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory 52(12), 5406–5425 (2006) 31. Donoho, D.: Compressed sensing. IEEE Transactions on Information Theory 52(4), 1289– 1306 (2006) 32. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse representation. PAMI 31(2), 210–227 (2009) 33. Boyd, S.P., Vandenberghe, L.: Convex optimization. Cambridge University Press, Cambridge (2004) 34. Chen, S., Donoho, D., Saunders, M.: Automatic decomposition by basis pursuit. SIAM Journal of Scientific Computation 20(1), 33–61 (1998) 35. Rasmussen, C.E., Williams, C.K.I.: Gaussian processes for machine learning. MIT Press, Cambridge (2006)
Head Pose Estimation Based on Manifold Embedding and Distance Metric Learning Xiangyang Liu1,2 , Hongtao Lu1 , and Daqiang Zhang1 1
MOE-Microsoft Laboratory for Intelligent Computing and Intelligent Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China {liuxy,htlu,zhangdq}@sjtu.edu.cn 2 College of Science, Hohai University, Nanjing, 210098, China
Abstract. In this paper, we propose an embedding method to seek an optimal low-dimensional manifold describing the intrinsical pose variations and to provide an identity-independent head pose estimator. In order to handle the appearance variations caused by identity, we use a learned Mahalanobis distance to seek optimal subjects with similar manifold to construct the embedding. Then, we propose a new smooth and discriminative embedding method supervised by both pose and identity information. To estimate pose of a head new image, we first find its knearest neighbors of different subjects, and then embed it into the manifold of the subjects to estimate the pose angle. The empirical study on the standard databases demonstrates that the proposed method achieves high pose estimation accuracy.
1
Introduction
Head Pose Estimation (HPE) from face images or videos is a classical problem in computer vision [1]. It is an integral component of multi-view face recognition systems, human computer interfaces and other human-centered computing applications [2]. Robust and identity-independent head pose estimation remains a challenging computer vision problem. Face images with varying pose angles were considered to be lying on a smooth low-dimensional manifold in high-dimensional feature space [3,4]. Furthermore, the dimension of this manifold is the degrees of freedom in the variations of head pose [5]. The dimensionality reduction based methods for head pose estimation seek a low-dimensional continuous manifold, and new images can then be embedded into these manifolds to estimate the pose [1]. The fundamental issue is to model the geometry structure of the manifold and produce a faithful embedding for data projection [6]. The classical technique, Principal Component Analysis (PCA), has been used to find the subspace constructed by primary components of training head images for head pose estimation [7,8]. Nevertheless, there is no guarantee that PCA obtains the subspace which is more related to pose variations rather than to appearance variations [1]. The embedding can be directly H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 61–70, 2010. c Springer-Verlag Berlin Heidelberg 2010
62
X. Liu, H. Lu, and D. Zhang
learned by manifold learning approaches, such as Locality Preserving Projections (LPP) [5] and Locally Embedded Analysis (LEA) [4]. To incorporate the pose labels that are usually available during training phase, Balasubramanian et al. [9] presented a framework based on pose information to compute a biased neighborhood for each point in the feature space. Based on the framework, Wang et al. [10] use Local Fisher Discriminant Analysis (LFDA)[11] to eliminate the variations of illumination in the embedding space. Yan et al. [12] proposed a synchronized manifold embedding method by minimizing the distances between each sample and its nearest reconstructed neighbors, and meanwhile maximizing the distances between different sample pairs. They all demonstrated their effectiveness for head pose estimation. However, how to extract effective pose features for the low-dimensional manifold, and synchronously ignore appearance variations like changes in identity, scale, illumination, etc [10], remain to be challenging problems for pose estimation due to the nonlinear and high data dimensionality. The focus of this paper is to seek the optimal low-dimensional manifold embedding describing the intrinsical pose variations and to provide an identity-independent pose estimator. The changes of pose images due to identity changes are usually larger than that caused by different similar poses shown in Fig. 1. Thus, it is difficult to obtain the identity-independent manifold embedding which preserves the pose differences [13]. In this paper, we propose a manifold embedding method based on Distance Metric Learning (DML) for head pose estimation. We first learn a Mahalanobis distance metric to make the images closer for subjects with similar manifold. Then, we use the learned Mahalanobis distance to seek subjects to construct an optimal embedding. And we propose a new smooth and discriminative embedding method supervised by both pose and identity information. To seek the optimum projection, we minimize the distances between different subjects with the same pose, and maximize the distances between different poses from a subject. The learned manifold with a unique geometric structure is smooth and discriminative. The proposed method aims to provide better intra-class compactness and inter-class separability in low-dimensional pose space than traditional methods. That is, the embeddings in low-dimensional pose space with different poses are kept apart, and the embeddings of different subjects with the same pose are close to each other. For a new image, we first find its k-nearest
Pose 30
Pose 35
Fig. 1. Head pose images with pose angles +30◦ and +35◦ from the FacePix database [14]. (Note that large appearance variations by identity and small variations by pose).
HPE Based on Manifold Embedding and Distance Metric Learning
63
neighbors of different subjects, and then embed it into the manifold of the subjects to estimate the pose angle. The remainder of this paper is organized as follows. Section 2 introduces our framework and details the mathematical formulation of the proposed manifold embedding method. Section 3 shows the experimental results and discussions. We concludes our works in Section 4.
2
Pose Estimation Using DML and Manifold Embedding
Assume that the training data are X = [x11 , x12 , · · · , x1P , · · · , xS1 , xS2 , · · · , xSP ]M×N , where xsp ∈ RM , s = 1, 2, · · · , S, p = 1, 2, · · · , P , S is the number of subjects, and P is the number of poses for a subject αs , and there are N = S × P samples in total. The pose angle of the sample xsp is denoted as βp . We aim to seeking a discriminative embedding that mapping the original M dimensional image space into an m dimensional feature space with m M . 2.1
Motivations
The change of pose images due to identity changes is usually larger than that caused by different poses of same subject. Thus, for head pose estimation, it is crucial to obtain the identity-independent manifold embedding which preserves the pose differences. Head images can be preprocessed by the Laplacian of Gaussian (LoG) filter to capture the edge map that is directly related to pose variations [9]. In addition, identity and pose information can be used to remove the individual redundancy for pose data in the embedding process [12]. Our proposed method is motivated by two observations: (1) The appearance variations caused by identity lead to translation, rotation and warp changes of the subject’s embeddings [13]. Two subjects with similar individual appearance almost lie on a same continuous manifold by Locally Linear Embedding (LLE) [15] shown in Fig. 2-(a). Otherwise, Fig. 2-(b) shows that the embeddings may not be close from two subjects with dissimilar individual appearance. (2) The change of pose images due to identity changes affect the smoothness and discriminability of the learned embedding manifold. We embed 30 subjects in a
2
2
0
0
-2 2
-2 0
0 -2
2
(a)
-2 2
-2 0
0 -2
2
(b)
Fig. 2. The 3-dimensional embedding of the pose data using LLE (k = 5 and pose variations between [−90◦ + 90◦ ] at 1◦ intervals). (a) 2 subjects having similar manifold. (b) 2 subjects having dissimilar manifold.
64
X. Liu, H. Lu, and D. Zhang
5 0 -5 -10
10
-5
1 0 -1 0.5
-2
0
5
0 0
5
-1
-0.5 0
-1 -1.5 1
10 -5 (a)
(b)
Fig. 3. The 3-dimensional embeddings of the pose data from 30 subjects with pose variations between [−75◦ + 75◦ ] at 2◦ intervals. (a) Using ISOMAP without identity information (k=200). (b) Using LPP with identity information.
3-dimensional manifold using ISOMAP [3] without identity information and using LPP [5] with identity information, the results are shown in Fig. 3-(a) and (b), respectively. We can see that the pose data do not lie on a smooth continuous manifold, this results in non-robust pose estimation. Taking account of the effect caused by the appearance variations from different subjects, we first learn a Mahalanobis distance metric to make the images closer for subjects with similar manifold. Then, we use the learned Mahalanobis distance to seek optimal subjects to construct the embedding. And we propose a new smooth and discriminative embedding method supervised by both pose and identity information. To estimate pose of a head new image, we first find its knearest neighbors of different subjects, and then embed it into the corresponding manifold of the subjects to estimate the pose angle. 2.2
Distance Metric Learning Using RCA
Relevant Component Analysis (RCA) [16] learn a full ranked Mahalanobis metric using side-information in the form of chunklets (subsets of points that are known to belong to the same although unknown class). Compared with other distance learning methods, RCA maximizes the mutual information, and is robust and efficient since it only uses closed-form expression of the data. Therefore, in our scheme, RCA is used to learn a Mahalanobis metric to seek a local structure of subjects with similar embeddings in the low-dimensional pose space. For arbitrary two subjects, we first compute the distance between the two subjects in the low-dimensional embedding by ISOMAP (k=100). Then, we construct the clunklets of pose images by the k-means algorithm. RCA compute the covariance matrix of all the centered data-points in chunklets as follows k
nj
= 1 C (xji − m j )(xji − m j )T , p j=1 i=1 n
(1)
j where chunklet j consists of images {xji }i=1 and its mean is m j , and assume (M = C −1 ) is used as a a total of p points in k chunklets. The inverse of C Mahalanobis distance D(xi , xj ) = (xi − xj )T M (xi − xj ).
HPE Based on Manifold Embedding and Distance Metric Learning
65
Later experiments show that the learned Mahalanobis distance can seek optimal subjects to construct a smooth and discriminative embedding. 2.3
Manifold Embedding
An efficient low-dimensional embedding for pose estimation should have good intra-class compactness and inter-class separability in low-dimensional pose subspace. That is, the embeddings with different poses are kept apart and at the same time the embeddings of different subjects with the same pose should be close to each other. So we learn a manifold that attempts to satisfy the following property: the distance between images from different subjects with the same pose should be less than the distance between images from the same subject with similar poses. Thus, ideal embedding method should satisfy
yps − yps < yps − yps ,
(2)
where yps is the low dimensional embedding of xsp , s is an arbitrary subject, and p is the nearest neighbor of the pose p for the subject s. Embedding Method. We seek for a low-dimensional embedding to provide intra-class compactness and inter-class separability [13]. The optimization of the projection is synchronous as follows: (a) Intra-class compactness: For each pose, the projection minimizes the distances between the embeddings of different subjects. Namely, min ypi − ypj 2 , (3) p
i,j
where ypi and ypj are the embeddings of the head images xip and xjp with the pose angle βp . (b) Inter-class separability: For each subject, the projection maximizes the distances between the embeddings of the different poses. Namely, max yis − yjs 2 Tij , (4) s
i,j
where Tij is a penalty for poses i and j. We use a heavy penalty to penalize the poses i and j when they are close to each other, this is given as Tij = exp(−βi − βj 2 )/ i exp(−βi − βj 2 ). To combine (3) and (4) simultaneously, we minimize the following objective i j 2 p i,j yp − yp J= , (5) s s 2 s i,j yi − yj Tij where J is the objective to seek for the embedding yps of the head pose xsp . The embedding process is illustrated in Fig. 4. The intrinsical embeddings from three subjects (denoted by circles) are shown in Fig. 4 (a). The optimization
66
X. Liu, H. Lu, and D. Zhang
(a)
(b)
Fig. 4. Illustration of the embedding method. (a) shows the intrinsical embeddings from 3 subjects (denoted by circles). (b) shows the embeddings which minimized the distances denoted by the solid lines and maximized the distances by the dashed lines.
for the projection is to minimize the distances between different subjects with the same pose and maximize the distances between different poses of a same subject. Fig. 4 (b) shows the corresponding embeddings which minimized the distances denoted by the solid lines and maximized the distances by the dashed lines. The objective of the embedding is to generate many pose clusters each corresponding to a specific pose angle. Within each of the pose clusters, there are three embedding pose data from three subjects. In this paper, we employ a linear projection approach, namely, the embedding is achieved by seeking a projection matrix W ∈ RM×m (m M ) such that yps = W T xsp , where yps ∈ Rm is the low-dimensional embedding of xsp ∈ RM . Then, W is obtained by the following optimization T i T j 2 T r(W T S2 W ) p i,j W xp − W xp W = arg min = arg max , (6) T s T s 2 W W T r(W T S1 W ) s i,j W xi − W xj Tij where T r(·) means the trace of a square matrix, and S1 = (xip − xjp )(xip − xjp )T , S2 = (xsi − xsj )(xsi − xsj )T Tij . p
i,j
s
(7)
i,j
The objective function in (6) can be solved with the generalized eigenvalue decomposition method as S2 Wi = λi S1 Wi , (8) where the vector Wi is the eigenvector corresponding to the i-th largest eigenvalue λi , and it constitutes the i-th column vector of the matrix W [13].
HPE Based on Manifold Embedding and Distance Metric Learning
2.4
67
Pose Estimation
To estimate pose of a head new image, we first find its k-nearest neighbors of different subjects, and then embed it into the corresponding manifold of the subjects to estimate the pose angle. Finally, the new head image’s pose β is estimated by the poses of its k-nearest neighbors in the low-dimensional manifold.
3
Experiments and Results
We conducted several experiments to examine the performance of our proposed method, and address the following questions: What about the smoothness and discriminability of the embedding, and the estimation accuracy under different dimensionality and pose angles? 3.1
Data Sets and Experimental Setup
The proposed method was validated using the FacePix database [14], which contains 5430 head images spanning −90◦ to +90◦ in yaw at 1◦ intervals. We also collected 390 head pose images from the Pointing’04 database [17] for testing. The images were equalized and sub-sampled to 32x32 resolution, and the LoG image feature space was used for our experiments shown in Fig. 5.
Fig. 5. Head pose images and the corresponding LoG images
To evaluate the performance of our system, we use the Mean Absolute Error (MAE) [1] which is computed by averaging the difference between expected pose and estimated pose for all images. We use half of the pose data to learn a Mahalanobis distance, and use the remaining data for the embedding learning and testing. To test the generalization ability, we use the leave-one-out strategy [18] for identity-independent estimation, i.e., we use one subject in turn as the testing data and use all the remaining subjects for the embedding learning. 3.2
Embedding Space
We use our proposed method on the data sets mentioned above to show the embedding manifold of the pose data. Fig. 6 shows the 3-dimensional manifold embeddings. The manifold is constructed by nearest 4 subjects with pose variations from [−75◦ + 75◦ ] at 4◦ intervals. Compared with the results illustrated in
68
X. Liu, H. Lu, and D. Zhang
1 0 -1 -2 -1
0
1
2
1
3
2
Fig. 6. Illustration of the smoothness of embedding. The 3-dimensional embeddings are constructed by 4 subjects with pose variations between [−75◦ + 75◦ ] at 4◦ intervals.
Fig. 3 by Isomap and LPP, our result has much better smooth and discriminative performance to provide intra-class compactness and inter-class separability in the low-dimensional pose space. Fig. 6-(b) shows the distance difference between the image and embedding space for similar poses of the same subject and different subjects with the same pose (We fix the subject 1 with pose 30◦ , and locate another points by the distance from it). We can see that the distance between images from different subjects with the same pose becomes less than the distance between images from the same subject with similar poses in the low-dimensional embedding space. It indicates that our proposed method provides better discriminability in the embedding space for pose estimation. Image space:
0
5.71
14.99
Subject 1
Subject 1
Subject 2
Pose 30°
Pose 35°
Pose 30°
Embedding space: 0
1.30
13.6
Fig. 7. Illustration of the discriminability of the embedding. It shows the distance difference for similar poses of the same subject and different subjects with the same pose.
3.3
Comparison of Our Method with Other Methods
We compare our method with other pose estimation methods: the global-based PCA method [7], and the local-based manifold learning LPP methods [5]. We implement our embedding method for pose estimation with Distance Metric Learning (DML) (use the learned Mahalanobis distance to seek k-nearest subjects) and without DML (use all the training subjects). Fig. 8 (a) shows the
HPE Based on Manifold Embedding and Distance Metric Learning 12
8 Method with DML Method without DML LPP PCA
7.5 7
Method with DML Method without DML LPP PCA
10
6.5
8 MAE
MAE
69
6
6
5.5 5
4
4.5 4
5
10
15 20 Dimensionality of embedding
25
30
2 -100
-50
(a)
0 Degree of pose angle
50
100
(b)
Fig. 8. Comparison of our method against other methods. (a) The MAE in different dimensionality. (b)The MAE under different poses.
pose estimation results in different dimensionalities. It shows that our proposed method improves the estimation performance compared to other methods, and the MAE of pose estimation is further reduced using DML. Fig. 8 (b) shows the MAE with pose variations from [−90◦ + 90◦ ] at 1◦ intervals. The result shows that the accuracy of our proposed method is still in general better than other methods. We notice that the MAE curve of our method with DML is much more flat than other methods within a relative wide range of the frontal view [−50◦ + 50◦ ], this implies that our method is more robust in estimating head pose of [−50◦ + 50◦ ]. Our method with DML achieves the average MAE of 4.18◦ shown in Table 1. Table 1. The MAE of all subjects Methods MAE
4
PCA 5.32
LPP 4.96
Method without DML Method with DML 4.45 4.18
Conclusions
In this paper, we present an embedding method based on distance metric learning for head pose estimation. The method provides better intra-class compactness and inter-class separability in low-dimensional pose subspace than traditional methods. For identity-independent head pose estimation, we achieved the MAE of about 4.18◦ on the standard databases, and even lower MAE can be achieved on larger data sets. In future, we try to test other distance metric learning methods, and plan to evaluate the proposed method in terms of feasibility for more complex real world scenarios.
Acknowledgments This work was supported by the Open Project Program of the National Laboratory of Pattern Recognition (NLPR), 863 Program of China (No. 2008AA02Z310), NSFC (No. 60873133) and 973 Program of China (No. 2009CB320900).
70
X. Liu, H. Lu, and D. Zhang
References 1. Murphy-Chutorian, E., Trivedi, M.: Head pose estimation in computer vision: a survey. IEEE Transactions on PAMI, 442–449 (2008) 2. Chai, X., Shan, S., Chen, X., Gao, W.: Locally linear regression for pose-invariant face recognition. IEEE Transactions on Image Processing 16(7), 1716–1725 (2007) 3. Tenenbaum, J., Silva, V., Langford, J.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290(5500), 2319–2323 (2000) 4. Fu, Y., Huang, T.: Graph embedded analysis for head pose estimation. In: Proc. of International Conference on Automatic Face and Gesture Recognition (2006) 5. Raytchev, B., Yoda, I., Sakaue, K.: Head pose estimation by nonlinear manifold learning. In: ICPR (2004) 6. Yan, S., Xu, D., Zhang, B., Zhang, H., Yang, Q., Lin, S.: Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on PAMI, 40–51 (2007) 7. Srinivasan, S., Boyer, K.: Head pose estimation using view based eigenspaces. In: ICPR, vol. 16, pp. 302–305 (2002) 8. Wu, J., Trivedi, M.: A two-stage head pose estimation framework and evaluation. PR 41(3), 1138–1158 (2008) 9. Balasubramanian, V., Ye, J., Panchanathan, S.: Biased manifold embedding: a framework for person-independent head pose estimation. In: CVPR (2007) 10. Wang, X., Huang, X., Gao, J., Yang, R.: Illumination and person-insensitive head pose estimation using distance metric learning. In: ECCV, vol. 2, pp. 624–637. IEEE Computer Society, Los Alamitos (2008) 11. Sugiyama, M.: Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. Journal of Machine Learning Research 8, 1027–1061 (2007) 12. Yan, S., Wang, H., Fu, Y., Yan, J., Tang, X., Huang, T.S.: Synchronized submanifold embedding for person-independent pose estimation and beyond. IEEE Transactions on Image Processing (2008) 13. Liu, X., Lu, H., Luo, H.: Smooth Multi-Manifold Embedding for Robust IdentityIndependent Head Pose Estimation. In: Jiang, X., Petkov, N. (eds.) Computer Analysis of Images and Patterns. LNCS, vol. 5702, pp. 66–73. Springer, Heidelberg (2009) 14. Little, D., Krishna, S., Black, J., Panchanathan, S.: A methodology for evaluating robustness of face recognition algorithms with respect to variations in pose angle and illumination angle. In: ICASSP, vol. 2 (2005) 15. Roweis, S., Saul, L.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290(5500), 2323–2326 (2000) 16. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning distance functions using equivalence relations. In: Proc. International conference on Machine learning, vol. 20, p. 11 (2003) 17. Gourier, N., Hall, D., Crowley, J.: Estimating Face orientation from Robust Detection of Salient Facial Structures. In: Proc. International Workshop on Visual Observation of Deictic Gestures (2004) 18. Hu, N., Huang, W., Ranganath, S.: Head pose estimation by non-linear embedding and mapping. In: ICIP, vol. 2, pp. 342–345 (2005)
3D Reconstruction of Human Motion and Skeleton from Uncalibrated Monocular Video Yen-Lin Chen and Jinxiang Chai Department of Computer Science and Engineering Texas A&M University, College Station Texas 77843-3112, USA {ylchen,jchai}@cse.tamu.edu
Abstract. This paper introduces a new model-based approach for simultaneously reconstructing 3D human motion and full-body skeletal size from a small set of 2D image features tracked from uncalibrated monocular video sequences. The key idea of our approach is to construct a generative human motion model from a large set of preprocessed human motion examples to constrain the solution space of monocular human motion tracking. In addition, we learn a generative skeleton model from prerecorded human skeleton data to reduce ambiguity of the human skeleton reconstruction. We formulate the reconstruction process in a nonlinear optimization framework by continuously deforming the generative models to best match a small set of 2D image features tracked from a monocular video sequence. We evaluate the performance of our system by testing the algorithm on a variety of uncalibrated monocular video sequences.
1
Introduction
Accurate reconstruction of 3D human motion and skeleton from uncalibrated monocular video sequences is one of the most important problems in computer vision and graphics. Its applications include human motion capture and synthesis, gait analysis, human action recognition, person identification, and videobased human motion retrieval. Building an accurate video-based motion capture system, however, is challenging because the problem is often ill-posed. Image measurements from monocular video sequences are often noisy and not sufficient to determine high-dimensional human movement. Occlusions, cloth deformation, image noise and unknown camera parameters might further deteriorate the performance of the system. This paper introduces a new model-based approach that simultaneously estimates 3D human motion, full-body skeletal lengths, and camera parameters using a small set of 2D image features extracted from monocular video sequences. The key idea of our approach is to construct a generative motion model from a large set of prerecorded human motion data to eliminate the reconstruction ambiguity of video-based motion capture. Our generative motion model is constructed from a large set of structurally similar but distinctive motion examples. H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 71–82, 2010. c Springer-Verlag Berlin Heidelberg 2010
72
Y.-L. Chen and J. Chai
We have found that a small set of parameters are often sufficient to model motion variations for particular human actions, e.g. walking, running, or jumping. We also construct a generative skeleton model from a large set of prerecorded human skeleton data to resolve the ambiguity for skeleton reconstruction. With such models, video-based motion reconstruction process can be formulated as a matching problem: given an input image sequence, both human motion and skeletal lengths can be estimated by adjusting the model’s parameters in such a way that it generates an “imagined image sequence” which is as similar as possible to the real measurement. Mathematically, we formulate the problem in a nonlinear optimization framework by minimizing an objective function that measures residual difference between “imagined images” and real images. We develop an efficient gradient-based multi-resolution optimization process to compute 3D human motion and full-body skeletal lengths as well as unknown camera parameters from a small set of 2D image features extracted from monocular video sequences. We run our optimization in a batch mode and simultaneously compute human motion across an entire sequence. We evaluate the performance of our approach by capturing a variety of human actions, including walking, running, and jumping, from uncalibrated monocular video sequences including walking, running, and jumping.
2
Background
In this section, we briefly discuss previous work on using human motion data for video-based motion reconstruction. We focus our discussion on reconstructing 3D human motion from monocular video sequences. One solution to model human motion priors is to construct statistical motion models from prerecorded human motion data [1,2,3]. Howe and colleagues [1] learned human motion priors with a Mixture-of-Gaussians density model and then applied the learned density model for constraining the 3D motion search within a Bayesian tracking framework. Brand [2] applied Hidden Markov Models (HMMs) to statistically model dynamic full-body movements and used them to transform a sequence of 2D silhouette images into 3D human motion. Pavlovic and his colleagues [3] introduced switching linear dynamic systems (SLDS) for human motion modeling and present impressive results for 3D human motion synthesis, classification, and visual tracking. A number of researchers have also constructed various statistical models for human poses and used them to sequentially transform 2D silhouette images into 3D full-body human poses [4,5,6]. Another solution is to use subspace methods to model dynamic behavior of human motion [7,8,9]. For example, Ormoneit and his colleagues [7] estimated a subspace model of typical activities from a large set of pre-registered 3D human motion data. They integrated the learned priors into a Bayesian tracking framework and utilized them to sequentially track 3D human motion from monocular video sequences. To avoid the computational expense of Monte Carlo methods, Urtasun and his colleagues [9] recently introduced a gradient-based optimization approach for human motion tracking. They initialized the tracker in the first
3D Reconstruction of Human Motion and Skeleton
73
Fig. 1. The first five modes of motion variations for walking
frame and then perform the tracking recursively forward with a small finite size window (typically a three-frame window). Troje [8] also explored a similar subspace model for walking motions in which the temporal variations in poses are expressed as a linear combination of sinusoidal basis functions and used them to recognize genders from optical motion capture data. Our work builds on the success of previous subspace methods for human motion modeling and tracking. However, we significantly extend subspace methods in several ways. First, we propose to use dynamic time warping functions to model speed variations of human movement. This allows us to reconstruct 3D human motion with different speed variations. Second, our system constructs a generative skeleton model from prerecorded human skeleton data and estimates human skeletal lengths directly from input video. This not only avoids manual specification/adjustment of human skeletal lengths but also removes tracking errors caused by inaccurate human skeletons. Third, unlike previous approaches, our approach simultaneously computes human motion, skeletal lengths, and camera parameters based on an entire video sequence. The batch-based optimization approach [10,11] significantly reduces the reconstruction ambiguity by utilizing image measurement in both past and future to estimate a current frame.
3
Overview
The key idea of our approach is to construct generative models for both human motion and skeleton from prerecorded data and then deform the generative models to best match 2D image measurements across an entire video sequence. The whole system contains three major components: Motion preprocessing. We pre-register a set of structurally similar but distinctive motion examples with a semi-automatic method. All motion examples are then warped into a canonical timeline specified by a reference motion. Human motion and skeleton parameterizations. We apply statistical analysis techniques to the pre-registered motion data and construct a parameterized human motion model p(λ) for particular actions, where λ is a vector parameter of the model. A motion instance p(λ, z) can be generated by warping an instance of the parameterized motion model p(λ) with an appropriate time warping function z. We also construct a parameterized skeleton model s(β) from a large set of prerecorded human skeleton data to constrain the solution space of skeletal lengths.
74
Y.-L. Chen and J. Chai
Human motion and skeleton reconstruction. We deform the generative models to generate an optimal motion sequence p(λ, z) as well as skeletal model s(β) that best matches 2D image measurements across an entire image sequence. We formulate this as a gradient-based optimization problem and automatically compute motion parameters (λ, z) and skeleton parameters β as well as camera parameters ρ from a small set of 2D joint trajectories extracted from input video. We conduct continuous optimization in a coarse-to-fine manner. We describe each of these components in more detail in the next two sections.
4
Motion Data Preprocessing and Analysis
This section discusses how to preprocess a set of example motions, how to build a generative model for human motion from the preprocessed data, and how to construct a generative model for human skeleton from prerecorded human skeleton data. 4.1
Human Motion Data Preprocessing
We construct generative motion models from a database of prerecorded 3D human motion sequences. The generative motion models are action-specific; we require all examples must be structurally similar for the same action. A set of walking examples, for instance, must all start out on the same foot, take the same number of steps, have the same arm swing phase, and have no spurious motions such as a head-scratch. To build a parameterized motion model for walking, we start with a walking database which contains a wide range of variations (e.g. speeds, step sizes, directions, and styles). We assume the database motions are already segmented into walking cycles. If a database motion contains multiple walking cycles, we manually segment the motion into multiple cycles. We denote the set of motion examples in the database as {xn (t)|n = 1, ..., N ; t = 1, ..., Tn }, where xn (t) is the measurement of a character pose of the n-th example motion at the t-th frame, and Tn is the number of frames for the n-th example. We preprocess all motion examples in the database by registering them against each other. More specifically, we pick one example motion (e.g. normal walking) as a reference motion and use it to register the rest of database examples with appropriate time warping functions. We assume the reference motion x0 (t) includes T frames. We register example motions in a translation- and rotation-invariant way by decoupling each pose from its translation in the ground plane and the rotation of its hips about the up axis [12]. To ensure the quality of training data, we choose to use a semi-automatic process to register all motion examples in the database. To align each database example with the reference motion, we first manually select a small set of “key” frames, instants when important structural elements such as a foot-down occurs. We use the “key” frames to divide example motions into multiple subsequences and apply dynamic time warping techniques [12] to
3D Reconstruction of Human Motion and Skeleton
75
automatically register each subsequence. Finally, we warp each motion example ¯ n (t), n = 1, ..., N with the estimated xn (t), n = 1, ..., N into a canonical timeline x time warping functions. For simplicity, we choose the reference motion x0 (t), t = 1, ..., T to define the canonical time line. 4.2
Human Motion Modeling
One way to parameterize human motions is a weighted combination of the warped motion examples. This representation, however, is not a compact representation because the number of parameters linearly depends on the number of motion examples. More importantly, there might be significant redundancy among motion examples due to spatial-temporal correlation of natural human movement. A better way is to apply statistical analysis to model variations in the warped motion examples. ¯ n ∈ RD×T by sequentially stacking all We form a high-dimensional vector X ¯ n (t), t = 1, ..., T , where D is the dimenposes in a warped motion example x sionality of the character configuration space. In our system, we model a human pose with 62 dimensional vector, which includes the position and orientation of the root (6D) and the relative joint angles of 20 joints (56D). We apply principle component analysis (PCA) to all warped motion examples ¯ n , n = 1, ..., N . As a result, we can construct a parameterized motion model P X using mean motion P0 and a weighted combination of eigenvectors Pj , j = 1..., m: P (λ1 , ..., λm ) = P0 +
m
λj Pj
(1)
j=1
where the weights λj are motion model parameters and the vectors Pj are a set of orthogonal modes to model motion variations in the training examples. A motion instance generated by the parameterized model can be represented by p(t, λ) = p0 (t) +
m
λj pj (t)
(2)
j=1
where λ = [λ1 , ..., λm ]T . What remains is to determine how many modes (m) to retain. This leads to a trade-off between the accuracy and the compactness of the motion model. However, it is safe to consider small-scale variation as noise. We, therefore, automatically determine the number of modes by keeping 99 percent of original variations. Figure 1 shows the first five modes of variations constructed from our walking database. We also fit a multivariate normal distribution for the motion parameters λ. The probability density function for the parameters λ is give by 1 (λj /σM,j )2 ] 2 j=1 m
prob(λ) ∼ exp[−
2 where σM,j is the eigenvalues of the covariance matrix.
(3)
76
Y.-L. Chen and J. Chai
Motion Warping. Given motion parameters λ, Equation 2 can be used to generate a motion instance p(t, λ), t = 1, ..., T based on the reference motion p0 (t) and a small set of base motions pj (t), j = 1, ..., m. However, the parameterized motion model p(t, λ), t = 1, ..., T is not appropriate to model motion variation from input video because the parameterized motion model is learned from warped human motion examples and does not consider speed variations in human actions. One solution to address this limitation is to post-warp the parameterized motion model with an appropriate warping function w(t). The time warping function can change the speed of the motion instance p(t, λ), t = 1, ..., T by mapping its old time t to a new time w(t) frame by frame. Note that modeling time warping function w(t) only requires recovering a finite number of values that w(t) can take since the domain of t = 1, ..., T is finite. Thus we can represent the time warping function w(t) with T finite values of w(t): w1 , ..., wT . If t is a frame in the reference motion p0 , then the corresponding frame in the model instance is w(t). At frame t, a motion instance generated by the model parameters λ has the pose p(t, λ). A warped parameterized motion instance can thus be defined as follows: p(wt , λ) = p0 (wt ) +
m
λj pj (wt )
(4)
j=1
We therefore can represent a generative motion model with two sets of parameters (λ, w), where w = [w1 , ..., wT ]T . In practice, a time warping function should satisfy the following properties: – Positive constraints: a time warping function should be positive: w(t) > 0. – Monotonic constraints: a time warping function should be strictly increasing: w(t) > w(t − 1). The monotonicity property makes sure that the time warping function is invertible so that for the same event the time points on two different time scales correspond to each other uniquely. Rather than modeling a time warping function w(t) in the original time space, we choose to transform the w(t) into a new space z(t): z(t) = ln(w(t) − w(t − 1)),
t = 1, ....T
(5)
We choose w(0) to be zero and further have w(t) =
t
exp[z(i)],
t = 1, ..., T
(6)
i=1
Equation 6 makes sure monotonic and positive constraints of the wt will be automatically satisfied if we model a time warping function in the new space zt . Our final generative motion model is defined as follows: t m t p(t, λ, z) = p0 ( exp[z(i)]) + λj pj ( exp[z(i)]) t = 1, ..., T i=1
j=1
i=1
(7)
3D Reconstruction of Human Motion and Skeleton
4.3
77
Human Skeleton Modeling
Similarly, we can construct a generative human skeleton model from a set of prerecorded human skeleton data. Our human skeletal data is downloaded from the online CMU mocap library 1 and represented by Acclaim Skeleton File (ASF) format. Each skeleton example records length of individual bones. In our experiments, the human skeletal model contain 24 bones: head, thorax, upper neck, lower neck, upper back, lower back, and left and right clavicle, humerus, radius, wrist, hand, hip, femur, tibia, and metatarsal (see Figure 2).
Fig. 2. The top five modes of human skeleton variations keeps about 99% of skeleton variations from a large set of prerecorded human skeletal models. Note that we normalize each bone segment using left tibia (red).
Since we are dealing with the image stream obtained by a single monocular video camera, the absolute length of a segment cannot be inferred from the image stream. We thus focus on the segment proportion vector s = (s1 , ..., s23 )T = (l2 /l1 , ..., l24 /l1 )T , rather than the bone segment lengths themselves, where li , i = 1, ..., 24 is the length of a bone segment. Similar to generative motion models, the generative model for human skeleton can be expressed as a base skeleton s0 plus a weighted combination of skeleton models sj , j = 1, ..., b: b s(β) = s0 + βj sj (8) j=1
We also fit a multivariate normal distribution for skeleton model weights β = [β1 , ..., βb ]T . Similarly, we model the probability density function for the skeleton parameters β using a multivariate normal distribution: 1 (βj /σS,j )2 ] 2 j=1 b
prob(β) ∼ exp[−
(9)
2 where σS,b is the eigenvalues of the covariance matrix.
4.4
Camera Model
To determine 3D configuration of a human figure from its 2D image projection, we need to model mathematical relationship between 3D coordinates and 2D image measurement. 1
http:///mocap.cs.cmu.edu
78
Y.-L. Chen and J. Chai
We assume a static uncalibrated camera and adopt a simplified perspective projection model. In general, a camera model with full degrees of freedom is usually parameterized by 11 parameters (six extrinsic and five intrinsic). We assume that the five intrinsic parameters are ideal; that is, with zero skew and unit aspect ratio (the retina coordinate axes are orthogonal and pixels are square), and that the center of the CCD matrix coincides with the principal point. The only unknown intrinsic parameter is the focal length f . Together with the six extrinsic parameters, the camera model is represented by ρ = [tx , ty , tz , θx , θy , θz , f ]T , where the parameters (tx , ty , tz ) and (θx , θy , θz ) describe the position and orientation of the camera, respectively. ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ T ⎞ x u f 00 r 1 tx ⎜ ⎟ ⎝ v ⎠ = ⎝ 0 f 0 ⎠ · ⎝ rT2 ty ⎠ · ⎜ y ⎟ (10) ⎝ ⎠ z ω 001 rT3 tz 1 where the column vector (u, v, ω)T is the homogeneous coordinate of the projection of a 3D point. The row vector rTi is the i-th row vector of the camera rotation matrix. We therefore can model the projection function g with a sevendimensional vector ρ.
5
Matching Parameterized Models to Image Streams
This section discusses how to apply generative models to reconstruct 3D human motion configuration from uncalibrated monocular video sequences. We use a small set of 2D joint trajectories yinput (t ), t = 1, ..., T to simultaneously compute human motion parameters (λ, z), human skeleton parameters β, and camera parameters ρ. We manually select a small set of joint points at key frames and then track their 2D positions yinput (t ) across the entire sequence using the space time tracking algorithm described in [13]. Here, we focus our discussion on how to transform a small set of 2D joint trajectories into 3D human motion. In an analysis-by-synthesis loop, the algorithm uses generative motion model to generate a motion instance p(t, λ, z), t = 1, ..., T in the joint angle space, uses the forward kinematics algorithm to compute 3D joint positions of a skeleton model instance s(β) throughout the whole motion, projects the computed 3D points into 2D image space with camera parameters ρ, and then updates the parameters according to the residual differences between the synthesized 2D joint points and the observed 2D image measurements. From parameter values (λi , z i , β i , ρi ), we can synthesize corresponding 2D joint trajectories in the image space as follows: ymodel (t, λi , z i , β i , ρi ) = g(f(p(t, λi , z i ), s(β i )), ρi ), t = 1, ..., T
(11)
where the vector-valued function f is a forward kinematics function which maps a joint-angle pose into 3D joint positions. The vector-valued function g is a projection function from the 3D position space to the 2D image space (see
3D Reconstruction of Human Motion and Skeleton
79
Equation 10). The reconstructed human motion and skeleton is supposed to be closest to the input video in terms of Euclidean distance: EI = t yinput (t ) − ymodel (λ, z, β, ρ, t)2 (12) Reconstruction of 3D human motion and skeleton from monocular video sequences is an ill-posed problem. Besides desired solutions, many non-natural human motions and skeletons might also be consistent with image measurement from input video. It is therefore essential to impose constraints on the set of solutions. In our system, we restrict the solution of human motion and skeleton to the vector space spanned by prerecorded examples. Within the model space of both human motion and skeleton, solutions can be further restricted by a tradeoff between matching quality and prior probabilities. More specifically, we consider the prob(λ) and prob(β) defined in Equation (3) and (9) as well as priors for time warping parameters z and camera parameters ρ. In our experiments, we assume a multi-variate gaussian prior for warping parameters: T 1 2 prob(z) ∼ exp[− (zj − zmean,j )2 /σZ,j ] (13) 2 j=1 where the scalar zmean,j is the mean of the example warping functions and the σZ,j is the standard deviation of the example warping functions. We also assume a constant or uniform prior for camera parameters ρ. In terms of Bayesian decision theory, the problem is to find the set of parameters to maximize the posterior probability, given image measurement yinput (t) from input video. While motion parameters λ, warping parameters z, skeleton parameters β, and camera parameters ρ completely determine the predicted image ymodel , the observed image measurement yinput may vary due to noise. For gaussian noise with a standard deviation σI , the likelihood to observe yinput is −1 prob(yinput |λ, z, β, ρ) ∼ exp[ 2σ 2 · EI ]. Maximum a posterior (MAP) estimation I is then achieved by minimizing the following cost function: E=
EI σI2
+
λ2j 2 j σM,j
+
βj2 2 j σS,j
+
j
(zj −zmean,j )2 2 σZ,j
(14)
We analytically evaluate the jacobian terms of the cost function and then run the optimization with the Levenberg-Marquardt algorithm in the Levmar library [14]. Initialization. We initialize the motion parameters λ and skeleton parameters β with zeros. The initial guess for warping parameters z is set to z mean , which can be obtained from the precomputed warping function examples. We initialize the camera parameters ρ with a rough manual adjustment. Multi-resolution motion reconstruction. To improve the speed and robustness of our optimization, we develop a multi-resolution optimization procedure
80
Y.-L. Chen and J. Chai
to reconstruct motion in a coarse-to-fine manner. We first form the input 2D motion yinput (t) and parameterized motion p(t, λ) in coarse levels by downsampling both the input 2D motion trajectories yinput (t), the mean motion p0 (t) and base motions pm (t) temporally. We start the reconstruction process in the coarsest level and run the optimization to reconstruct the coarsest input motion with the coarsest parameterized motion. After the optimization in level 1 converges, we initialize the time-warping function z in level 2 by upsampling the estimated time-warping function in level 1 and initialize the motion, skeleton, and camera parameters (λ, β, ρ) in level 2 with the estimated parameters from level 1. We repeat this process until the algorithm converges at the finest level. In our experiments, we set the downsampling rate to 2. The multi-resolution optimization runs very fast. For all the testing examples reported here, it takes less than two seconds to compute the whole solutions. Multiple cycles. Our algorithm can also be applied for automatic segmentation and reconstruction of long motion sequences that contain multiple cycle of the parameterized motion models, e.g. multiple walking cycles. The system sequentially matches the parameterized motion model with the whole sequence. Once a good match is found, the system deletes the matched subsequence and repeats the process until no further matches are found.
6
Results
We demonstrate the performance of our system in a variety of monocular video sequences. Our results are best seen in the accompanying video although we show sample frames of a few results here. The motion data sets were captured with a Vicon motion capture system of 12 MXF20 cameras with 41 markers for full-body movements at 120Hz and then were downsampled to 30HZ. The current walking database includes 200 aligned motion examples with variations of speeds, step sizes, directions, and styles. The running databases includes 100 motion examples with different speeds, step sizes, directions and styles. The jumping databases includes 50 motion examples with different jumping heights, jumping distances, directions and styles. The number of parameters for the walking, running and jumping are 30, 20 and 18 respectively by keeping 99 percent of the motion variations. We tested the performance of our algorithm with a variety of uncalibrated video sequences, including normal walking, stylized walking, running, and jumping. We first select a small number of joint points at key frames and then interactively track 2D positions of the image features throughout the whole sequence. Figure 3 shows sample images from input video sequences as well as reconstructed 3D human motion and skeleton seen from two different viewpoints. Our system can reconstruct high-quality human motion and skeleton even with a very small set of 2D joint trajectories. For example in the jumping example (see Figure 3), we reconstruct 3D human motion and skeletal lengths from five joint trajectories.
3D Reconstruction of Human Motion and Skeleton
81
Fig. 3. Video-based human motion reconstruction for stylized walking and jumping: (top) the input video with tracked image features and reconstructed skeleton and motion. (middle) the reconstructed 3D motion and skeletal model from the same viewpoint. (bottom) the reconstructed 3D motion from a new viewpoint.
82
7
Y.-L. Chen and J. Chai
Conclusion
We present a new model-based approach for simultaneous reconstruction of 3D human motion and skeletal model from uncalibrated monocular video sequences. We construct generative models for both human motion and skeleton from prerecorded data and then deform the models to best match a small set of 2D joint trajectories derived from monocular video sequences. We also develop an efficient multi-resolution optimization algorithm to estimate an entire motion sequence from input video. We demonstrate the power and effectiveness of our system by capturing 3D human motion data and skeletons from a variety of single-camera video streams, including walking, running, and jumping.
References 1. Howe, N., Leventon, M., Freeman, W.: Bayesian reconstruction of 3d human motion from single-camera video. In: Advances in Neural Information Processing Systems, vol. 12, pp. 820–826 (1999) 2. Brand, M.: Shadow puppetry. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1237–1244 (1999) 3. Pavlovi´c, V., Rehg, J.M., MacCormick, J.: Learning switching linear models of human motion. In: Advances in Neural Information Processing Systems, vol. 13, pp. 981–987 (2000) 4. Rosales, R., Sclaroff, S.: Infering body pose without tracking body parts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 506–511 (2000) 5. Agarwal, A., Triggs, B.: 3d human pose from silhouettes by relevance vector regression. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 882–888 (2004) 6. Elgammal, A., Lee, C.: Inferring 3d body pose from silhouettes using activity manifold learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 681–688 (2004) 7. Ormoneit, D., Sidenbladh, H., Black, M., Hastie, T.: Learning and tracking cyclic human motion. In: Advances in Neural Information Processing Systems, vol. 13, pp. 894–900 (2001) 8. Troje, N.: Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision 2(5), 371–387 (2002) 9. Urtasun, R., Fleet, D.J., Fua, P.: Temporal motion models for monocular and multiview 3d human body tracking. Computer Vision and Image Understanding (CVIU) 104(2), 157–177 (2006) 10. Fua, P.: Regularized bundle-adjustment to model heads from image sequences without calibration data. International Journal of Computer Vision 38(2), 153–171 (2000) 11. DiFranco, D., Cham, T.J., Rehg, J.M.: Reconstruction of 3d figure motion from 2d correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 307–314 (2001) 12. Kovar, L., Gleicher, M.: Flexible automatic motion blending with registration curves. In: ACM SIGGRAPH/EUROGRAPH Symposium on Computer Animation, pp. 214–224 (2003) 13. Wei, X., Chai, J.: Interactive tracking of 2d generic objects with spacetime optimizatione. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 657–670. Springer, Heidelberg (2008) 14. Lourakis, M.: Levmar: Levenberg-marquardt nonlinear least squares algorithms in C/C++ (2009), http://www.ics.forth.gr/~ lourakis/levmar/
Mean-Shift Object Tracking with a Novel Back-Projection Calculation Method LingFeng Wang1 , HuaiYu Wu2 , and ChunHong Pan1 1
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
[email protected],
[email protected] 2 Peking University, Key Laboratory of Machine Perception (MOE)
[email protected]
Abstract. In this paper, we propose a mean-shift tracking method by using the novel back-projection calculation. The traditional back-projection calculation methods have two main drawbacks: either they are prone to be disturbed by the background when calculating the histogram of targetregion, or they only consider the importance of a pixel relative to other pixels when calculating the back-projection of search-region. In order to solve the two drawbacks, we carefully consider the background appearance based on two priors, i.e., texture information of background, and appearance difference between foreground-target and background. Accordingly, our method consists of two basic steps. First, we present a foregroundtarget histogram approximation method to effectively reduce the disturbance from background. Moreover, the foreground-target histogram is used for back-projection calculation instead of the target-region histogram. Second, a novel back-projection calculation method is proposed by emphasizing the probability that a pixel belongs to the foreground-target. Experiments show that our method is suitable for various tracking scenes and is appealing with respect to robustness.
1
Introduction
Tracking is a critical task in various applications such as surveillance [1], perceptual user interfaces [2], vision-based control, and so on. However, the performance of tracking algorithm is greatly influenced by the variability of the potential target. This variability arises from three major sources: the illumination variation, partial occlusion, and target appearance change. The mean-shift algorithm has recently been employed for object tracking by measuring the similarity between consecutive frames [3,4]. Generally, the meanshift algorithm [5,6] is a nonparametric statistical method to seek the nearest sample distribute mode based on kernel density estimation. Specifically, the principle of the mean-shift tracking method is to track the centroid of targetregion by combining the sample weights in a local neighborhood with the kernel
This research is sponsored by Natural Science Foundation of China(NSFC No. 60675012)
H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 83–92, 2010. c Springer-Verlag Berlin Heidelberg 2010
84
L. Wang, H. Wu, and C. Pan
function. Accordingly, two fundamental problems need to be addressed: how to produce the sample weight in search-region (this problem is also called the back-projection problem), and how to select the kernel function. This paper proposes a method to realtime tracking based on a novel backprojection calculation. The back-projection is always calculated by using a color-based target appearance model, which represented with color histogram. Traditionally, back-projection calculation based on color histogram is to replace each pixel with the probability associated with the histogram [7]. However, these methods have two main drawbacks. First, the calculated histogram of targetregion is prone to be disturbed by background. In other words, if the targetregion contains too much background information, the target-region histogram may become narrow on background pixels. Consequently, when calculating the back-projection of search-region, the sample weights of background pixels may have higher value than those of foreground-target ones. In this case, mean-shift tracking method will not track the foreground-target any longer, but the background. Second, it only considers the importance of a pixel relative to other pixels when calculating the back-projection of search-region. However, the probability that a pixel belongs to the foreground-target is not took into account. Recently, Collins and Liu [8] propose a method to partially solve the above two drawbacks by emphasizing the importance of the background appearance. They claim that the feature space that best distinguishes object from background is just the best feature space that can be used for tracking. However, this method does not distinguish the difference of two drawbacks, and roughly consider both of them as background disturbance. Moreover, it does not approximate the histogram of foreground-target, which is vary important during tracking. In this paper, we propose a novel method to take background appearance into consideration. The flow chart of our tracking method is summarized in Fig.2. As is shown in Fig.2, the core of our method consists two basic steps: approximating histogram of foreground-target and calculating back-projection of search-region. In the first step, the ratio of foreground-target in target-region is approximated and obtained by varying the background ratio in target-region. Alg.1 describes the first step in detail. In the second step, there are actually two key problems, i.e, how to describe the probability that a pixel belongs to the foreground-target and how to encode the importance of a pixel relative to other pixels in searchregion. Different from the traditional methods, which only consider the later problem, we take both two problems into account. Alg.3 describes the second step in detail. Experimental results show that our method is especially suitable for the scenes where the background has obvious texture information. This paper is organized as follows: Section 2 describes our method in detail; Section 3 presents experiment results; the conclusive remark is addressed at the end of this paper.
2
Approach Implementation
The strategy of tracking is to search the candidate-region most similar to targetregion between consecutive frames. Usually, color features are used to describe
Mean-Shift Object Tracking previous frame
foreground-target , in-target-region s background
target-region
85
current frame
, out-target-region s background
candidate-region search-region
Fig. 1. Overview of all regions. The figure describes the target-region, candidate-region, search-region, foreground-target, and two types of backgrounds which are in-targetregion’s background and out-target-region’s background.
N a region by calculating its color histogram q = {qi }N k=1 , where i=1 qi = 1, and then the Bhattacharyya distance [9] is used to measure the similarity between the histograms of two regions. In order to describe our method clearly, we illustrate all types of regions in Fig.1. Then, we denote the histogram of target-region as qT , and the histogram of foreground-target as qF . We assume that local regions of background have texture information. Thus, we can derive that two types of background histograms (i.e. the histogram of in-target-region’s background and the histogram of out-target-region’s background), are similar. Therefore, we can mark them as the same background histogram qB in a unified way. As illustrated in Fig.1, target-region is composed of foreground-target and background. Accordingly, the histogram qT of target-region is linearly superposed by two corresponding histograms qF and qB . But if the target-region contains much background information, the ratio of foreground-target is small, the histogram qT can not exactly describe histogram qF that is however quite crucial in tracking. Therefore, we propose an algorithm to approximate the novel back-projection calculation calculating two histograms qT and q B
approximating histogram q F
previous frame
qT
STEP_1
qF
calculating back-projection b STEP_2
applying mean-shift algorithm current frame
qB current frame
getting search-region
Fig. 2. The overview of our tracking algorithm: First, an input image is analyzed by calculating qT and qB . Then, a novel back-projection calculation is performed based on two basic steps. Note that, STEP 1 (described by Alg.1 in detail), is performed to approximate the histogram qF of foreground-target while STEP 2 (described by Alg.3 in detail), is performed to calculate the back-projection b of search-region. Finally, a mean-shift algorithm is applied to search the next most possible position.
86
L. Wang, H. Wu, and C. Pan
histogram qF (Alg.1 describes the procedure in detail) to reduce the disturbance of background. The flow chart of our tracking method is summarized in Fig.2. A color-value input image is first analyzed by calculating two histograms of target-region qT and background qB . Then, in the second stage, a novel back-projection calculation is method with two basic steps: approximating the histogram qF of foreground-target (Alg.1 describe it in detail), and calculating back-projection b of search-region (Alg.3 describe it in detail). Finally, a classical mean-shift algorithm is applied to the back-projection b so as to search the next most possible position. Consequently, we can conclude that the kernel of our method is totally dependent on the two basic steps performed in the second stage. In subsections 2.1 and 2.2, we specify the two steps in detail. 2.1
Approximating the Histogram of Foreground-Target
The histogram qT is linearly superposed by two histograms qF and qB . The histogram qF can be calculated by: qT = αqF + (1 − α)qB
⇒
qF =
qT − (1 − α)qB α
(0 < α < 1)
(1)
where α is the foreground-target ratio in target-region. As described in Fig.1, the position of target-region can be obtained at previous frame. Thus, the histograms qT of target-region and qB of background (i.e. out-target-region’s background) can be gained. According to Eqn.1, we then need to confirm the ratio α in order to calculate qF . First of all, the Bhattacharyya distance between two regions qI and qJ is defined as follow, d = ρ(qI , qJ ) =
N
qIi qJi
(2)
i=1
From Eqn.2, we can derive ρ(qF , qB ) < ρ(qB , qB ) = 1. As the histogram of target-region is composed of the histograms qF of foreground-target and qB of background, the distance ρ(qT , qB ) falls between ρ(qF , qB ) and ρ(qB , qB ), i.e. ρ(qF , qB ) < ρ(qT , qB ) < ρ(qB , qB ). Accordingly, we define a new histogram qTξ parameterized with ξ as follows, qTξ =
| qT − ξqB | α 1−α−ξ B =| qF + q | Θξ (1 − ξ) Θξ (1 − ξ) Θξ (1 − ξ)
(0 ξ < 1)
(3)
where ξ denotes the varying ratio of background in target-region, and Θξ is a B |qT i −ξqi | normalization coefficient, that satisfies N i=1 Θξ (1−ξ) = 1. Correspondingly, we define the distance between qTξ and qB as dξ , given by dξ = ρ(qTξ , qB ) = ρ(|
α 1−α−ξ B qF + q |, qB ) Θξ (1 − ξ) Θξ (1 − ξ)
(4)
In Eqn.3, the background ratio is approximated to | 1−α−ξ 1−ξ |. So we deduce three conclusions: 1. if ξ < 1 − α, the background ratio in histogram qTξ decreases as ξ increases. 2. if ξ > 1 − α, the background ratio in histogram qTξ decreases as
Mean-Shift Object Tracking
87
ξ decreases. 3. if 1 − α − ξ = 0, the background ratio decreases to zero, i.e. the histogram qTξ = qF . We assume that when qTξ = qF , the distance dξ reaches to the minimum value. In the experiment section, this hypothesis is validated to be reasonable with small error in practice. Accordingly, if the ratio α = 1−ξ, the distance dξ reaches to the minimum value. Therefore, we can calculate the ratio α by the following procedures. First, we discrete the ratio ξ by small increment, and calculate the corresponding distance dξ . After that, we obtain the ξ, and the corresponding distance dξ corresponds to the minimum value. At last, we gain the ratio α by α = 1 − ξ and the histogram qF from the Eqn.1. The Alg.1 ˆ F are the approximate value of describes the procedure in detail. Note that α ˆ, q α, qF respectively, and dmin = min{dξ }. ˆF Algorithm 1. Calculate αˆ and q
1 2 3 4 5 6 7 8 9
2.2
Data: histogram qT of target-region, histogram qB of background ˆ F of foreground-target Result: ratio α ˆ and histogram q dmin = +∞; α ˆ = 0; for ξ ← 0 : 0.01 : 1 do calculate qTξ by Eqn.3, calculate dξ = ρ(qTξ , qB ) by Eqn.4; if dmin dξ then dmin = dξ , α ˆ = 1 − ξ; end end ˆ F by Eqn.1 calculate q
Calculating Back-Projection
In this subsection, a novel method for back-projection calculation is proposed. Above all, some definitions are listed out as follow. Denote the search-region as s = {s1,1 , s1,2 , ..., sh,w }, where h, w are the height and width of s respectively, and si,j is the pixel value in s at (i, j). Denote the search-region’s back-projection as b = {b1,1 , b1,2 , ..., bh,w }. As the calculation of bi,j of each (i, j) is the same, we use s, b to instead si,j , bi,j .
Algorithm 2. Calculate Back-Projection b By the Traditional Method
1 2 3 4 5 6
Data: qF and the search-region s Result: the back-projection of search-region b [h,w] = size(s); for i ← 1 to h do for j ← 1 to w do v = s(i, j); b(i, j) = qF v; end end
88
L. Wang, H. Wu, and C. Pan
Back-projection b represents the probability that the pixel s belongs to foreground-target relative to others. Therefore, two aspects should be solved: 1. Encoding the importance of pixel s by comparing it with others. 2. Describing the probability that s belongs to the foreground-target. Traditional back-projection calculation methods (Alg.2) only consider the first aspect. To overcome the limitation of these methods, we propose a novel method by considering both two aspects together. Similar to the traditional methods (Alg.2), the first aspect is solved by calculating probability qF s . We then focus on the second aspect. Specifically, the second aspect can be defined by the probability that a pixel belongs to foreground-target p(qF |s). Same as p(qF |s), we define p(qB |s) as the probability that a pixel belongs to background. Simply, we assume that these two probabilities are independent. Thus, we can calculate b by multiplying these two probabilities, given by F b = qF s p(q |s)
(5)
According to the Bayesian rule, we get the probability of p(qF |s), p(qF |s) =
p(s|qF )p(qF ) p(s|qF ) = p(s) p(s|qF ) + λp(s|qB )
(λ =
p(qB ) ) p(qF )
(6)
From the Eqn.6, p(s|qF ), p(s|qB ) and λ are required in order to calculate the probability p(qF |s). As described in subsection 2.1, two histograms qF and qB are calculated at each frame. The two histograms essentially describe the probabilities that the pixel belongs to foreground-target or background. Thus, their B B probabilities can be calculated by p(s|qF ) = qF s and p(s|q ) = qs . As is known, λ is the probability that the pixel belongs to foreground-target compared to background. In other words, λ describes the ratio of foreground-target in targetregion. As the ratio α of foreground-target is gained in subsection 2.1, we can α get λ by λ = 1−α . Based on the Eqn.5 and Eqn.6, a novel back-projection b calculation algorithm is presented (in Alg.3). Algorithm 3. Calculate Back-Projection b By Our Novel Method
1 2 3 4 5 6 7
α Data: qF , qB , the ratio λ = 1−α and s Result: the back-projection of search-region b [h,w] = size(s); for i ← 1 to h do for j ← 1 to w do v = s(i, j); bF (i, j) = qF bB (i, j) = qB v; v; F b (i, j) b(i, j) = bF (i, j) F ; b (i, j) + λbB (i, j) end end
Mean-Shift Object Tracking
3
89
Experiments and Results
This section contains two parts. First, we perform some experiments to validate the hypothesis proposed in the subsection 2.1. That is, when qTξ = qF or ξ = 1 − α, the distance dξ reaches to the minimum value. Second, we perform some experiments to test the novel tracking algorithm. Hypothesis Testing Based on Synthetic Data: Simply, we generate test data by two Gaussian distributions qF = N (μF , σF ) of foreground-target and qB = N (μB , σB ) of background with the same variances σ, that is σ = σF = σB . Some definition are listed as follow: the similarity between foreground-target and background (two Gaussian distributions) is defined as: | μF − μB | δ= 2 , 2 (σF + σB )/2
(7)
ˆ the estimation error η of the foreground-target ratio α is defined as η = |α−α| α , where α ˆ is calculated by Alg.1. Our experiment contains three basic steps. First, we count the real similarity δ in practice. Then, we obtain the ξ by varying the ratio α and fixing the similarity δ = 1.5, and the corresponding distance dξ is corresponds to the minimum value. And then, we calculate the estimation error η by varying the ratio α and similarity δ. The three steps are specified below:
Step1 : We select 1000 images I = {I1 , I2 , ..., IN }. For each image Ii , we label out its foreground-target and background, and calculate the mean μF,B and variance B σF,B of two histograms qF i and qi . Then, we obtain the similarity δi from the Eqn.7. At last, δ is gained by counting the mean, δ = N1 N i=1 δi . Statistical results show that the similarity δ ≈ 1.8 in practical. Step2 : We take α = {0.1, 0.3, 0.5, 0.7, 0.9} and δ = 1.5. For each α, we calculate dξ by Alg.1. The result is shown in Fig.3(a). As illustrated in this figure, dξ decreases at beginning and then increases when varying with ξ. Moreover, when ξ = 1 − α, the distance dξ approximates the minimum value dmin .
d[
K
D
1.00
PF PB V
0.1 0.3 1.25
0.5 0.7
1.50 0.9 1.75
dmin | 03 . 1D | 01 . 1D
| 07 . 1D
| 05 . 1D
| 09 . 1D
[
2.00
D
Fig. 3. Figure(a) shows the result of dξ by varying ξ while figure(b) shows the result of η bu varying α. All the relevant parameters are listed as follow. Figure(a): α = {0.1, 0.3, 0.5, 0.7, 0.9}, ξ = {0.01, 0.02, ..., 0.99} and δ = 1.5. Figure(b): δ = {1.00, 1.25, 1.5, 1.75, 2.00} and α = {0.10, 0.11, ..., 0.90}.
90
L. Wang, H. Wu, and C. Pan
Step3 : We take δ = {1, 1.25, 1.5, 1.75, 2} and α = {0.1, 0.11, ..., 0.9}. For each δ, we calculate the estimation error η by varying α. The result is shown in Fig.3(b). As is shown in this figure we can find that, when δ increases, the mean of η decreases. Furthermore, when δ 1.75, the estimation error η < 0.1. From Step2, we can see that when ξ = 1 − α, the distance dξ will approximate the minimum value. Furthermore, the Step3 shows that when δ 1.75, the estimation error η of foreground-target ratio α is less than 0.1. And from Step1, statistical results show that δ ≈ 1.8. To sum up, the approximate error of α is no more than 0.1 in practical. Since the error of the ratio α is small compared to other factors in realtime tracking, this hypothesis is reasonable. Tracking Results: The new back-projection calculation method is applied to mean-shift object tracking. The core of our tracking algorithm contains three principle steps. First, Alg.1 is used to gain the ratio α and the histogram qF of foreground-target. Then, Alg.3 is used to calculate the back-projection b of search-region. At last, mean-shift algorithm is employed to find the next most probable position. Further analysis [8] shows that five features chosen most often are R-G, 2G-R, 2G-B-R, 2G+B-2R, and 2R-G-B. In our experiments, 2G-B-R, 2G+B-2R, and 2RG-B are combined as the feature. Moreover, the width and height of out-targetregion’s background are selected to be two times of those of the target-region at previous frame, while the width and height of search-region are selected to be three times of those of the target-region at current frame. Since the histogram qF of forground-target varies frame by frame, we need to update it as follow, F F qF t = (1 − β)qt−1 + βqt
(8)
where β is the updating ratio and t is the frame time. In our experiments, the update ratio β is assigned with = 0.01. Fig.4 shows some tracking results by our mean-shift tracking method. The tested data is gained from VIVID Tracking Evaluation web site [10]. As is shown in the figure, three different scenes are tested. The large images on the left are the tracking scenes and small images on the right are the tracking results. The numbers pointing to the target-region are the foreground-target ratio α in the target-region. In Fig.4(a), although the foreground object’s ratio α is no more than 0.5, the target-region still can be tracked well. The main reason is that the foregroundtarget histogram approximation (Alg.1) overcomes the disturbance of background. Although there is much background information in target-region, Alg.1 can also approximate the histogram of foreground-target exactly. We can see that the foreground-target ratio α in Fig.4(b,c) vary more obviously than the ratio in Fig.4(a). In other words, foreground-target and background are more similar in Fig.4(b,c) than Fig.4(a). In this challenging case, the target-region can still been correctly tracked in a certain degree. The main reason is that the novel back-projection calculation (Alg.3), that consider the probability a pixel belongs to foreground-target, decreases the disturbance of background. Therefore, although the ratio α in some frames are calculated imprecisely, target-regions can still been correctly tracked in Fig.4(b,c).
Mean-Shift Object Tracking
91
Fig. 4. This figure shows three tracking results based on our mean-shift tracking method with the novel back-projection calculation. The large images on the left are the tracking scenes (the first frame of each scene). Since all the target-regions are small compared to the whole scene, only image patches, that contain target-regions, are shown. The numbers pointing to the target-regions are the foreground-target ratio α, which are calculated by the Alg.1.
4
Conclusion
To robustly track moving targets is an important topic in many vision applications. This paper performs mean-shift tracking based on a novel back-projection calculation method. The contributions in this paper are two-fold. First, we propose an approach to better approximate the histogram of foreground-target. Thus, the blur caused by the background greatly decreases. Second, in back-projection calculation, we take an addition condition into account, which is used to describe the probability that a pixel belongs to the foreground-target. Compared with the traditional back-projection calculation, it decreases the disturbance from the similar
92
L. Wang, H. Wu, and C. Pan
pixels on background. Note that there is still insufficiency when the background has no texture information, so the histogram of foreground-target may be approximated imprecisely. Fortunately, the influence of this drawback can be weakened by using our new back-projection calculation method. Experimental results show that our method is robust and suitable for many scenes, especially when the background has obvious texture information.
References 1. Greiffenhagen, M., Comaniciu, D., Niemann, H., Ramesh, V.: Design, analysis and engineering of video monitoring systems: An approach and a case study. IEEE Proceedings 89, 1498–1517 (2001) 2. Bradski, G.R.: Computer vision face tracking as a component of a perceptual user interface. In: IEEE Workshop on Applications of Computer Vision, pp. 214–219 (1998) 3. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 142–149 (2000) 4. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transaction on Pattern Analysis and Machine Intelligence 25, 564–577 (2003) 5. Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function, with applications in pattern recognition 21, 32–40 (1975) 6. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 603–619 (2002) 7. Lee, J.H., Lee, W.H., Jeong, D.S.: Object-tracking method using back-projection of multiple color histogram models 2, 668–671 (2003) 8. Robert Collins, Y.L., Leordeanu, M.: On-line selection of discriminative tracking features. IEEE Transaction on Pattern Analysis and Machine Intelligence 27, 1631– 1643 (2005) 9. Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37, 145–151 (1991) 10. Collins, R., Zhou, X., Teh, S.K.: An open source tracking testbed and evaluation web site. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2005) (January 2005)
A Shape Derivative Based Approach for Crowd Flow Segmentation Si Wu1 , Zhiwen Yu1,2 , and Hau-San Wong1 1
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong 2 School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
[email protected], {zhiwenyu,cshswong}@cityu.edu.hk
Abstract. Crowd movement analysis has many practical applications, especially for video surveillance. The common methods are based on pedestrian detection and tracking. With an increase of crowd density, however, it is difficult for these methods to analyze crowd movement because of the computation and complexity. In this paper, a novel approach for crowd flow segmentation is proposed. We employ a Weighting Fuzzy C-Means clustering algorithm (WFCM) to extract the motion region in optical flow field. In order to further analyze crowd movement, we make use of translation flow to approximate local crowd movement, and design a shape derivative based region growing scheme to segment the crowd flows. In the experiments, the proposed method is tested on a set of crowd video sequences from low density to high density. Keywords: Crowd Flow Segmentation, Translation Domain, Shape Derivative, Region Growing Scheme.
1
Introduction
There are lots of scenes in which people are assembled for various activities in the real world. In order to detect the abnormal events in the crowd, many surveillance systems have been developed and applied to traffic management, human behavior recognition, pedestrian counting, and so on. For the low density crowd, researchers have proposed a large number of methods to detect and track the pedestrians in the video sequences, and many of them, such as Mean-shift [1], have achieved good performance. Unfortunately, the methods for detecting and tracking pedestrian are not applicable to the high density crowd because of serious occlusion and enormous computation. Recently, Ali [2] proposed a method which analyzed crowd movement through optical flow field. However, there still exist several difficulties in the crowd flow analysis method based on discrete vector field segmentation. On one hand, it is hard to obtain the smooth vector field from the crowd video because of the incoherence of the motions of the individuals. On the other hand, the crowd may smoothly change the flow direction so that it is difficult to represent the crowd flow field. H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 93–102, 2010. c Springer-Verlag Berlin Heidelberg 2010
94
S. Wu, Z. Yu, and H.-S. Wong
In this paper, we propose a shape derivative based approach to analyze crowd movement. To obtain a robust and smooth optical flow field, we first use a set of key points to represent the image plane. According to the movement information on the key point location, an interpolation method based on Delaunay Triangulation is employed to estimate the whole vector field. Then, a fuzzy clustering technique is used to extract the motion region. In order to further analyze the crowd movement, we use translation flow to approximate the local crowd movement and design a shape derivative based region growing scheme to segment the motion region. In the experiments, the proposed method is tested on several crowd videos from low density to high density, and have a good performance on motion detection and crowd flow segmentation. The remainder of the paper is organized as follows. Section 2 describes related work. Section 3 provides a brief overview to the shape derivative based translation domain segmentation method. Section 4 gives the details on the proposed approach. Section 5 evaluates the performance of our method by the experiments. Finally, the paper is concluded in Section 6.
2
Related Work
Recently, a number of researchers have performed analysis on crowd movement. Most of their works focus on human segmentation in crowd. In [3], the foreground was reconstructed by a human shape model, and the solution was determined by maximizing the posterior probability. Based on the same human shape model, Rittscher [4] employed an Expectation Maximization formulation to optimize the likelihood function which was related to the shape and location of humans. Similar to [4], Tu [5] introduced the monolithic whole body classifiers to the analysis of the partially occluded regions. In [6], a shape indexing strategy was proposed to quickly estimate the number of humans and their positions. To our best knowledge, there are only a few works on crowd movement analysis which considered the crowd as a single entity. Among these works, Ali [2] employed Lagrangian Particle Dynamics to segment crowd flow based on an optical flow field representation, and in [7], optical flow field was also used for motion pattern learning in a crowd scene. For the segmentation of 2D discrete vector field, Shi [8] proposed the normalized cut criterion and applied it to image segmentation. Li [9] used the Green Function method to implement Hodge Decomposition, and obtained the curl-free components and the divergence-free components for the vector field segmentation. In [10] and [11], the segmentation of vector field was implemented by minimizing an energy function, and Roy [12] employed shape derivative technique to solve the minimization problem. Our method is inspired by [2]. We try to use optical flow field to represent the crowd movement in the image plane, and combine the shape derivative technique with a region growing scheme to segment the flow field.
A Shape Derivative Based Approach for Crowd Flow Segmentation
3
95
Shape Derivative Based Translation Domain Segmentation
In this section, we briefly introduce the shape derivative based translation domain segmentation model in [12]. For a two-dimensional vector field E, the translation domain can be defined as the domain Ω in E whose field lines are straight lines. Since any vector in translation domain should be orthogonal to the normal along a field line, there exist a unique parameter a(Ω) of C + (C + is the set {(cos(θ), sin(θ))|θ ∈ [− π2 , π2 ]}) which satisfies Eq. (1) a(Ω) · E(x) = 0.
(1)
Consequently, the translation domain segmentation can be transformed into the process of finding the largest region which satisfies Eq. (1) in E. This problem can be represented by minimizing the following energy function J(τ ) = (a(τ ) · E(x))2 dx − μ dx, (2) Ω(τ )
Ω(τ )
where τ is an evolution parameter and μ is a positive constant. To solve the above optimization problem, we calculate the partial derivative of J with respect to τ [12] [10] [11] as follows dJ (τ, V ) = − [(a(τ ) · E(s))2 − μ]V (s) · Nτ (s)ds, (3) dτ ∂Ω(τ ) where ∂Ω(τ ) is the oriented boundary of Ω(τ ), s is the arclength along ∂Ω(τ ), and Nτ is the inward unit normal of Ω(τ ). When the deformation V is assigned as (a(τ )·E)2 −μ, the derivative Eq. (3) is negative. Therefore, a contour evolution equation for translation domain, as shown in Eq. (4), is used for the minimization of the energy function J(τ ) ∂Γ = [(a(τ ) · E)2 − μ]Nτ , ∂τ
(4)
where Γ (τ = 0) is the initial contour of the estimated translation domain. In addition, if Ω is a translation domain, the parameter a(Ω) can be calculated through minimizing the following function KΩ (a) = (a · E(x))2 dx. (5) Ω
Therefore, a is the unique solution of Eq. (6) in C + Qτ a = δmin a, where
Qτ =
E(x)E(x)T dx,
Ω(τ )
and δmin is the smallest eigenvalue of Q.
(6)
96
S. Wu, Z. Yu, and H.-S. Wong
For the translation domain segmentation, the iteration consists of the following two steps: first, the parameter a is calculated in the estimated translation domain Ω according to Eq. (6); second, the domain Ω can be updated according to the contour evolution equation as shown in Eq. (4).
4
The Proposed Method
The proposed method consists of three stages: flow field estimation, motion segmentation and a region growing scheme. Specifically, we first use an interpolation method to estimate the optical flow field according to the movement information on the key point location. Then, a fuzzy clustering algorithm is employed to extract the motion region. Finally, we design a shape derivative based region growing scheme to segment the crowd flows. 4.1
Flow Field Estimation
The optical flow field, as an example of a discrete vector field, can represent the movement of the objects in video sequences. The optical flow component is computed by Lucas and Kanade [13] method in Eq. (7) AU = b
(7)
where U = (u, v), and ⎛
Ix1 ⎜ Ix2 ⎜ A=⎜ ⎝ Ixp
⎞ ⎛ ⎞ Iy1 −It1 ⎜ −It2 ⎟ Iy2 ⎟ ⎟ ⎜ ⎟ ⎟ , b = ⎜ .. ⎟ , .. ⎠ ⎝ . . ⎠ Iyp −Itp
and Ix , Iy and It are the partial derivatives of the image I(x, y) with respect to the horizontal and vertical coordinates, and time respectively. According to the least mean square method, the solution of Eq. (7) can be represented as Eq. (8) U = (AT A)−1 AT b.
(8)
To measure the numerical stability of Eq. (8), we introduce the condition number n of AT A , which is defined in Eq. (9) B · B −1 , if B is non − singular n= (9) ∞. if B is singular In order to obtain the robust flow field, we compute the optical flow components on the key point locations instead of all the pixel locations. Then, an interpolation method for the scattered data is used to estimate the whole flow field.
A Shape Derivative Based Approach for Crowd Flow Segmentation
97
The Scale Invariant Feature Transform method (SIFT) [14] is employed to detect the key points in the image plane which can be considered as the scalespace extrema. Their locations are invariant to the scale change of the image. Therefore, the key points can provide the motion information more accurately than the other pixel locations in the plane. For each key point, the optical flow components are calculated in the R, G and B channels respectively. Only the result in the channel, in which the condition number n of AT A is the smallest, is considered as the valid optical flow component for further processing. Since the distribution of the key points is scattered, we employ Delaunay Triangulation to segment the image plane. For each pixel in the image plane, there is a unique triangle in which it is closest to it. Therefore, the optical flow components at each pixel location can be interpolated by the vertices of the closest triangle according to Eq. (10) U (X) =
3
ωi φ(X, Xi )U (Xi ),
(10)
i=1
where φ(X, Xi ) ωi = 3 , i=1 φ(X, Xi )
φ(X, Xi ) = exp(−(X − Xi )Σ −1 (X − Xi )T ), and Xi is one of the vertices of the triangle and ωi is the weight of the vertex. φ(X, Xi ) in Eq. (10) is the membership measure of X with respect to Xi , which can be interpreted as an attenuation coefficient. Finally, a median filter is used to smooth the pixels on the triangle edge. 4.2
Motion Segmentation
Since only the motion regions in the image plane are of interest to us, we employ a Weighting Fuzzy C-Means clustering algorithm (WFCM) [15] to extract the motion region, IM , in the optical flow field. We first convert the optical flow field into a scalar field. On each pixel location, the scale value is defined as the norm of the optical flow vector U . A gray image IV can be obtained by normalizing the scale field. The intensity at each pixel location in the image plane is proportional to the flow vector norm at the corresponding location in the flow field. Let L = {l1 , l2 , . . . , ln } be a set of gray levels in IV . In order to highlight the contribution of the important gray levels, a set of weights W = {w1 , w2 , . . . , wn } for L is given as in Eq. (11) wi =
h(li ) , (i = 0, 1, . . . , n − 1), M ·N
(11)
where h(li ) is the number of pixels with gray level li , and the size of the image is M × N . WFCM is employed to search for a partition of X into 2 clusters which correspond to the background and the motion region in the image plane.
98
S. Wu, Z. Yu, and H.-S. Wong
(a)
(b)
(c)
(d)
Fig. 1. The flow field estimation for the marathon video. (a) A certain frame of the marathon video. (b) The extracted key points. (c) Delaunay Triangulation. (d) The estimated flow field.
4.3
A Region Growing Scheme
In this stage, the vectors in IM are normalized into unit vectors, and only the direction information is related to further processing. An important assumption is that local crowd motion can be approximated by a translation domain. Based on this assumption, it is reasonable to select the most coherent region in IM as the initial seed region. For the ith vector in IM , the neighboring vectors in the window with size s×s form a set of unit vectors, {(u1 , v1 ), (u2 , v2 ), . . . , (us2 , vs2 )}. The coherence measure for the flow field is given in Eq. (12) Ci = exp(−(stdui + stdvi )),
(12)
where stdki
s2
1
¯ 2 , (k = u, v), = 2 (kj − k) s j=1 2
s 1
k¯ = 2 kj , (k = u, v). s j=1
Let the initial seed region be a window R with size s × s. We employ a set of boundary control points P = {p1 , p2 , . . . , pm }, which are extracted by equidistant sampling on the edge, to control the shape of R. For a boundary control point pi , the local translation domain Ωi is composed of the interior points of R in the neighborhood with size r × r. Consequently, the parameter
A Shape Derivative Based Approach for Crowd Flow Segmentation
99
a(Ωi ) determined by Eq. (6) can be used to represent Ωi . To implement region growing, pi will move in the direction described by the function D D(a(Ωi ), pi ) = sgn((a(Ωi ) · U (pi ))2 − μ)N (pi ),
(13)
where U (pi ) and N (pi ) are the unit vector and the unit normal of the boundary at the pi location respectively, and μ is a tolerance coefficient. Eq. (13) implies that the boundary control point will expand outward when its vector is coherent with the local domain in the tolerance range, and will shrink inward when the coherence between its vector and the region is out of the tolerance range. When all the boundary control points in P have updated their positions, the region R can be reassigned by the new boundary through the interpolation of P . Then, P will be reassigned according to equidistant sampling in the new boundary. This iteration will continue until there is no change in the shape of the region. After obtaining the segmented region R, we will remove it from the motion region IM . Then a new window, whose center is at the location with the highest coherence in IM , is considered as the new seed region R. The previous region growing process is repeated for R. When IM becomes an empty set, the original motion region is segmented into a set of subregions with local coherence. When the crowd density is low, the motion region may be oversegmented. Let B = {B1 , B2 , . . . , BL } be the set of subregions. These subregions will be merged if they satisfy the following equations ⎧ ⎪ ⎨ dist(Bi , Bj ) ≤ S, ¯ (Bi ), U ¯ (Bj ) (14) U ⎪ ≤ θ, ⎩ arccos ¯ ¯ U (Bi ) · U (Bj ) where dist(Bi , Bj ) is the minimum distance between the interior points of Bi ¯ i ) and U ¯ (Bi ) are the average vectors of Bi and Bj respectively. and Bj , and U(B
5
Experiments and Discussion
The experiments consist of two parts: the experiments for comparison between the translation domain segmentation model (Section 3) and the proposed method, and the experiments for crowd flow segmentation. 5.1
Shape Derivative Based Region Growing
There are two simulated flow fields for testing the performance of the proposed region growing scheme. We simulate the following two cases: the crowd smoothly changes the motion direction, and the crowd flows around an ellipse. Since the scheme works on the normalized vector field, the vectors in the motion region are assigned to unit vectors and others are assigned to zero. In the two examples, we manually select the initial seed regions labeled by the blue rectangles in Figure 2(a) and (d), then the regions are processed by the translation domain segmentation model and also by the proposed region growing scheme. The parameter,
100
S. Wu, Z. Yu, and H.-S. Wong
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. The results of the simulated flow field segmentation. (a) and (d) are two different simulated flow field. The initial regions are marked by the blue rectangle. The results of translation domain segmentation are marked by the blue contours in (b) and (e). The results of the proposed method are shown in (c) and (f).
μ, in Eq. (13) is in the range [0.1, 0.3]. In Figure 2(b) and (e), the translation domain segmentation model fails to identify the flow as an entity, since neither of the simulated flow field is considered as a single translation domain. On the other hand, the proposed region growing scheme successfully expands the initial region to the whole motion region and the results are shown in Figure 2(c) and (f). 5.2
Crowd Flow Segmentation
The experiments consist of three crowd videos for testing the performance of the proposed method in the different scenes. For each video, only two consecutive frames are required to compute the optical flow field by the method presented in Section 4.1. The parameter, μ, in Eq. (13) is within the range [0.5, 0.7]. For the parameters, S and θ, in the region merging criteria (Eq. (14)), the user can specify the values according to the video resolution and the density of the crowd. In the segmentation results, the background is marked by the blue color, and the motion regions marked by the same color belong to the same class. In the first example (Figure 3(a)), the pedestrians are walking in the street. The detected motion regions can correctly cover most moving pedestrians. The pedestrians in the red region are moving towards the left, and the pedestrians in the green region are moving towards the right. In the second example (Figure 3(b)), the pedestrians are crossing the intersection. According to the pedestrians’ locations, the crowd can be divided into three parts: left, middle and right. However, there are several pedestrians having different motion directions from others in the right part. Therefore, the crowd should be partitioned into four
A Shape Derivative Based Approach for Crowd Flow Segmentation
(a)
(b)
101
(c)
Fig. 3. The results of crowd flow segmentation. There are three different crowd scenes in (a)-(c). The top row is a certain frame of the test video, and the bottom row is the segmentation result.
parts. Our method successfully detects most motion regions and segment the crowd into four parts. The pedestrians in the four different color regions belong to the above four parts, in particular the yellow region which represents the pedestrians with abnormal motion in the right group. In the third example (Figure 3(c)), the scene is a Marathon race and the density of the crowd is high. In this case, the crowd flows around the center building. Our method can correctly identify the crowd flow represented by the red region. Since the crowd movement in the upper part of the scene is not significant, our method cannot detect the complete motion region.
6
Conclusion
In this paper, we investigate the crowd flow segmentation problem and propose a new framework for crowd movement analysis. Different from the pedestrian detection and tracking systems, we directly implement crowd flow segmentation using optical flow field. To represent the crowd movement, an interpolation method based on Delaunay Triangulation is used to estimate the smooth optical flow field in a robust way. In addition, the motion region can be efficiently extracted by a fuzzy clustering algorithm, while avoiding the difficulty of thresholding. Finally, the shape derivative technique is combined with a region growing scheme to optimize the evolution of the seed region. In the experiments, the proposed method effectively segments the crowd flows in the test videos. Acknowledgments. The work described in this paper was partially supported by a grant from the Research Grants Council of Hong Kong Special Administrative Region, China [Project No. CityU 121607] and a grant from City University of Hong Kong [Project No. 7002374].
102
S. Wu, Z. Yu, and H.-S. Wong
References 1. Beleznai, C., Fr¨ uhst¨ uck, B., Bischof, H.: Human Tracking by Fast Mean Shift Mode Seeking. Journal of Multimedia 1, 1–8 (2006) 2. Ali, S., Shah, M.: A Lagrangian Particle Dynamics Approach for Crowd Flow Segmentation and Stability Analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–6 (2007) 3. Zhao, T., Nevatia, R.: Bayesian Human Segmentation in Crowded Situations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 459– 466 (2003) 4. Rittscher, J., Tu, P., Krahnstoever, N.: Simultaneous Estimation of Segmentation and Shape. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 486–493 (2005) 5. Tu, P., Sebastian, T., Doretto, G., Krahnstoever, N., Rittscher, J., Yu, T.: Unified Crowd Segmentation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 691–704. Springer, Heidelberg (2008) 6. Dong, L., Parameswaran, V., Ramesh, V., Zoghlami, I.: Fast Crowd Segmentation Using Shape Indexing. In: IEEE International Conference on Computer Vision (ICCV), pp. 1–8 (2007) 7. Hu, M., Ali, S., Shah, M.: Learning Motion Patterns in Crowded Scenes Using Motion Flow Field. In: IEEE International Conference on Pattern Recognition (ICPR), pp. 1–5 (2008) 8. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 9. Li, H., Chen, W., Shen, I.: Segmentation of Discrete Vector Fields. IEEE Transactions on Visualization and Computer Graphics 12(3), 289–300 (2006) 10. Cremers, D.: Motion Competition: Variational Integration of Motion Segmentation and Shape Regularization. In: Van Gool, L. (ed.) DAGM 2002. LNCS, vol. 2449, pp. 472–480. Springer, Heidelberg (2002) 11. Cremers, D., Soatto, S.: Motion competition: A Variational Approach to Piecewise Parametric Motion Segmentation. International Journal of Computer Vision 62, 249–265 (2005) ´ Barlaud, M., Aubert, G.: Segmentation of a Vector Field: 12. Roy, T., Debreuve, E., Dominant Parameter and Shape Optimization. Journal of Mathematical Imaging and Vision 24, 259–276 (2006) 13. Lucas, B., Kanade, T.: An Iterative Image Registration Technique with an Application to Stereo Vision. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 674–679 (1981) 14. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 91–110 (2004) 15. Pei, J., Yang, X., Gao, X., Xie, W.: Weighting Exponent m in Fuzzy C-Means (FCM) Clustering Algorithm. In: SPIE Multispectral Image Processing and Pattern Recognition, pp. 246–251 (2001)
Learning Group Activity in Soccer Videos from Local Motion Yu Kong1,2, Weiming Hu1 , Xiaoqin Zhang1 , Hanzi Wang3 , and Yunde Jia2 1
National Laboratory of Pattern Recognition, Institute of Automation, P.R. China 2 Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing 100081, P.R. China 3 School of Computer Science, The University of Adelaide, SA 5005, Australia {kongyu,jiayunde}@bit.edu.cn, {wmhu,xqzhang}@nlpr.ia.ac.cn,
[email protected]
Abstract. This paper proposes a local motion-based approach for recognizing group activities in soccer videos. Given the SIFT keypoint matches on two successive frames, we propose a simple but effective method to group these keypoints into the background point set and the foreground point set. The former one is used to estimate camera motion and the latter one is applied to represent group actions. After camera motion compensation, we apply a local motion descriptor to characterize relative motion between corresponding keypoints on two consecutive frames. The novel descriptor is effective in representing group activities since it focuses on local motion of individuals and excludes noise such as background motion caused by inaccurate compensation. Experimental results show that our approach achieves high recognition rates in soccer videos and is robust to inaccurate compensation results.
1
Introduction
Group activity recognition is an important research topic in computer vision community which aims to make machines recognize activities of a group of people. In contrast to an individual activity, a group activity is a behavior of multiple individuals who unite and act for the same goal. In a soccer match, two teams attack, defend or challenge for the ball. In this paper, we aim to recognize their activities from a video sequence captured by a moving camera. There are two main problems lying in our task. One is ego-motion compensation. In a soccer video, the court view camera is tracking the ball throughout the whole soccer match. Thus, ego-motion compensation is required. In previous work, compensation is usually conducted by finding matches of keypoints (e.g. KLT features [1], [2]) on consecutive frames and then estimating camera motion parameters using the coordinates of the keypoints ([3], [4]). Affine model is usually employed to model camera motion. The other key problem is action representation. A common way for action representation is first to detect spatial-temporal interest points [5] and then H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 103–112, 2010. c Springer-Verlag Berlin Heidelberg 2010
104
Y. Kong et al.
represent an action by a set of quantized local features around the detected interest points [6], [7]. As for group action representation, [4] use the direction of global motion to represent group actions and a novel two-channeled motion descriptor is proposed to characterize global motion. In this paper, we propose a novel approach based on local motion information for group activity recognition. Given keypoint matches on successive frames, our method groups these keypoint matches into two sets, the background point set which is used for estimating the parameters of camera motion model, and the foreground point set which is utilized to represent group actions. After that, relative motion between the corresponding foreground keypoints on two consecutive frames is computed and then quantized to create relative motion words. Group actions in our work are represented by a collection of quantized relative motion features. In contrast to global motion such as the optical flow, our approach concentrates on local motion information and is thus robust to inaccurate compensation results.
2
Related Work
An important issue in action recognition is to detect spatial-temporal local regions of human movement. Laptev and Lindeberg [5] extended spatial interest points into spatial-temporal directions to detect regions with variations in both spatial and temporal dimensions. D´ollar et al.[7] detected interest points by local maxima of a response function calculated by separable linear filters. The SIFT descriptor [8] is a widely used technique in object recognition and is extended to a 3-dimensional descriptor to describe local regions of human actions [6]. The above spatial-temporal interest points (STIPs) are suitable for large scale human figure. However, figure size in soccer videos is much smaller than that in regular human action videos. Thus, salient regions in soccer videos may not be well detected by STIPs. In addition, our task is slightly different from previous work in human action recognition which usually concerns actions of a single human, while we focus on actions of a group of people. Therefore, STIPs are not capable of representing motion of a group of people. Another critical problem in action recognition is action representation. Recently, a common way for representing action is using quantized local features around interest points. In [9], action representation is derived by using 2D quantized local features of motion images word and quantized angles and distances between visual words and the reference point. Liu and Shah [10] investigated the optimal number of video-words by using Maximization of Mutual Information clustering. With action representation, many approaches such as generative models (e.g. pLSA [11], hierarchical graphical models [12], LDA [13]) and discriminative models (e.g. SVM [14]) can be employed for recognition. In recent years, player activity analysis in sport videos has been widely investigated. Efros et al.[15] proposed a motion descriptor based on the optical flow to recognize individual action in medium view on soccer court. Their method split the optical flow vector field into four non-negative channels and constructed a
Learning Group Activity in Soccer Videos from Local Motion
105
7UDLQLQJLQSXWV 6,)7IHDWXUHV GHWHFWLRQDQG PDWFKLQJ
.H\SRLQWV JURXSLQJ
(VWLPDWLQJ DIILQHPRGHO SDUDPHWHUV
&RPSXWLQJ UHODWLYHPRWLRQ
3UHGLFWHG UHVXOWV
690
&RQVWUXFWLQJ UHODWLYHPRWLRQ FRGHERRN
7HVWLQJLQSXW
Fig. 1. The flowchart of our approach
motion-to-motion similarity matrix by summing up the frame-to-frame similarities over a temporal window. A slice based optical flow histograms (S-OFHs) approach proposed in [16] was to recognize left-swing or right-swing of a player in tennis videos. Kong et al.[4] proposed a global motion-based approach for recognizing group action. The global motion in their work is derived by computing the optical flow.
3
Our Approach
The flowchart of our approach is shown in Fig.1. Given a video sequence, the SIFT keypoint matches are first detected on two successive frames. Then these keypoints are grouped into two sets, the background point set and the foreground point set. Points in the background point set are used to compute the parameters of affine transformation model and points in the foreground point set are applied to represent group actions. After compensation, we first compute the relative motion between corresponding keypoints in the foreground point set on two successive frames and then create relative motion codebook by using bag-ofwords strategy to quantize relative motion. With this codebook, each action is represented as a collection of relative motion words and classified by the SVM. 3.1
Ego-motion Compensation
The camera involved in soccer videos always moves to track the ball during broadcasting. Thus, ego-motion compensation technique is necessary to transform frames into the same coordinate system. Keypoint Detection and Matching. Given a video, the SIFT [8] keypoints are first detected on each frame. Then matches of the detected keypoints are found on the next frame. Keypoints with no matches (e.g. on objects that are temporally occluded) will be discarded. However, the detected keypoints are not only associated with background objects but also with foreground objects such as players (see Fig.2). Thus, keypoints on moving objects should be eliminated to make accurate compensation. In our work, we group these keypoints into two sets based on distance variation measure.
106
Y. Kong et al.
Background Point Selection. We aim to group the detected keypoint matches into two sets according to the objects they are associated with. One set is the background point set that consists of points associated with static objects such as the soccer court. Points in this set are used to compute the affine transformation parameters. The other set is called the foreground point set that contains keypoints detected on moving objects such as players. Points associated with moving objects are utilized in group action representation. Since keypoints may be associated with moving objects (e.g. players) or static objects (e.g. soccer court), there are three types of relationships for two keypoints: two points on static objects; one point on a moving object and the other on a static object; and two points on moving objects. Distance variation for two points associated with static objects on two successive frames is within a small range while it is somewhat large for the case that one point on a static object and the other point on a moving object. However, distance change for two points on moving objects on two consecutive frames is uncertain. It can be within a small range or larger than a threshold. Based on the fact that a few keypoints are associated with static objects in soccer videos, we can distinguish the case that points on static objects from the case that points on moving objects by counting the number of keypoints where distance variation for these points and a given keypoint is within small range. The more keypoints meet the above requirement, the more probable the given keypoint is associated with a static object. Based on the above discussion, we first compute the n ∗ n Euclidean distance matrix Dt between pairs of keypoints on a frame at time t, where n is the number of detected keypoints on the frame. Then we compute the difference matrix D between two distance matrices Dt and Dt+1 of two frames: D = Dt − Dt+1 . For a keypoint pi , its score is computed as the number of points where the distance variation D on two successive frames for these points and the given point is within a small range. Points with scores higher than a threshold γ are grouped into the background point set. The remaining points are treated as foreground points used for group action representation. Fig.2 shows that our method has accurately discriminated the foreground points from the background points.
Fig. 2. Results of keypoints grouping. Points in this figure are keypoint matches on two successive frames. The red crosses are associated with moving objects and the white points are associated with static objects. This figure is best viewed in color.
Learning Group Activity in Soccer Videos from Local Motion
107
Ego-motion Compensation. The compensation technique employed in our work is similar to [4]. Since only pan and tilt are considered, a linear affine model is used to model camera motion: xt+1 xt a1 a2 xt a5 = Ttt+1 +b= + , (1) yt+1 yt a3 a4 yt a6 where (xt ,yt ) are 2D coordinate of a keypoint on the tth frame, Ttt+1 and b are parameters derived by minimizing the sum of squared differences. After the egomotion compensation phase, two consecutive frames are transformed into the same coordinate system and then can be used in motion descriptor extraction. 3.2
Computing Local Motion Descriptor
A group action is a behavior of a group of individuals. These individuals unite and act for the same goal. In soccer videos, motion of two groups is correlated. One group attacking leads to the activity of defending of the other group. Therefore, we just model motion of two groups rather than differentiating two groups and model their motion separately. In our work, we utilize local motion information to represent group actions in soccer videos. Local Motion Representation. In our work, local motion information is obtained by computing the relative motion of the points in the foreground point set. After the ego-motion compensation phase, all points in two consecutive frames are in the same coordinate system. Thus, the relative motion of foreground points can be accurately derived by measuring distance between corresponding foreground keypoints of two frames. Assume that ftt+1 is the foreground point set containing keypoint matches between frame t and t + 1. In ftt+1 , the two-dimensional coordinate of the ith keypoint at frame t is ft (i) = (xi , yi ) and the coordinate of its match keypoint at frame t + 1 is ft+1 (i). Then the relative motion of the ith keypoint at two successive frames is ri = fi (t + 1) − (Ttt+1 fi (t) + b),
(2)
where Ttt+1 and b are the affine transformation parameters derived in Eq.(1). In our representation, the polar coordinate system is utilized to capture the angles and the distances between elements in a relative motion vector and the origin (0, 0). We first transform the Cartesian coordinates of a relative motion vector r to the polar coordinates: r = (ρ, θ), where ρ is the distance from the origin to the elements in the relative motion and θ is a counterclockwise angular displacement. The procedure of generating a relative motion word is illustrated in Fig.3. Provided that the radial coordinate is divided into K bins and the angular coordinate is divided into N equal bins, then each element in the relative motion vector r can be put into one of the K ∗ N bins to generate a 2D descriptor. After this phase, the angle words and the distance words are generated for elements
Y. Kong et al.
DQJOH
108
GLVWDQFH
UHODWLYHPRWLRQFRGHERRN
Fig. 3. Procedure of generating a relative motion word. Each point in the figure is an element in a relative motion vector.
in the relative motion vector. We simply concatenate the 2D bins and construct the relative motion codebook with a size of K ∗ N . After relative motion words generation, relative motion of keypoints in the foreground point set on two successive frames is represented by a collection of the relative motion words. To represent a group action, a histogram of the relative motion words is computed on every two consecutive frames. The histograms for all pairs of frames are summed and then normalized to serve as the feature of a group action. Compared with the global motion information, our representation excludes motion information of background caused by inaccurate compensation and focuses on local motion of players. Thus, motion information in our representation is more reliable. It can be seen in the robustness test in Section 4.1. 3.3
Action Recognition
Given a new video, the task is to classify the new input to one of the action classes. In our approach, the linear Support Vector Machine (SVM) [18] is employed to classify group actions. SVM is a powerful technique in classification tasks. It first maps data to high dimensional space and then looks for separating the hyperplane with the largest margin. Suppose the training data are represented as {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )} where xi (i = 1, 2, · · · , n) denotes the action video descriptor and the corresponding yi is the class label, we apply an one-versus-one approach for multi-class classification in which k(k − 1)/2 classifiers are constructed. Voting strategy is used in classification.
4
Experiments
We tested our method on the soccer dataset from [4]. The dataset contains 3 group actions: “left side attacking” (LA), “left side defending” (LD) and “stalemate” (ST). There are 40 videos for each action to provide a total of 120 videos. Example frames of the action categories are shown in Fig.4.
Learning Group Activity in Soccer Videos from Local Motion
109
Fig. 4. Example frames from video sequences of the dataset. The first row is the frames of “left side attacking” action. The second row shows the “left side defending” action and the third row shows the “ stalemate” action. The dataset contains 120 videos.
The default experiment settings are as follows. The distance variation range δ is set to [−0.3, 0.3] and the score threshold γ is set to the average of maximum and minimum scores of points in the current frame. Empirically, the radial coordinate is divided into 3 bins: [0, 2.25), [2.25, 4) and [4, +∞) and the angular coordinate is divided into 4 bins: [−45◦ , 45◦ ), [45◦ , 135◦ ), [135◦ , 225◦ ), and [225◦ , 315◦ ). The LIBSVM [19] is used as the implementation of SVM. We conducted experiments on the soccer dataset to 1) demonstrate the performance of our approach for group activity recognition; 2) test the robustness of our approach using different compensation results; and 3) compare our approach with previous methods. 4.1
Experimental Results
Performance of our approach. We evaluated our approach on the soccer dataset using both leave-one-out (LOO) and split training strategies. In the Table 1. Confusion matrix of our approach using (a) the leave-one-out training strategy and (b) the split training strategy. The recognition rates of our approach are respectively 81.7% and 90.0% with the LOO and split strategy. (a) results using the leave- (b) results using the split one-out training strategy training strategy Action LA ST LD
LA ST LD 0.83 0.17 0 0 0.93 0.07 0 0.30 0.70
Action LA ST LD
LA ST LD 0.93 0.07 0 0 0.87 0.13 0 0.10 0.90
110
Y. Kong et al.
LOO strategy, for each type of actions, one video is selected as the test data and the rest are used as the training data. In the split strategy, we randomly selected 10 videos of each action as the training data and the rest as the testing data. Confusion matrices of our approach are shown in Table 1. Our approach achieved 81.7% accuracy with the LOO strategy and 90.0% with the split strategy. Results show that the proposed local motion descriptor is a discriminative feature for group actions. As can be seen from Table 1 (a) and (b), there is no misclassification between “left side attacking” and “left side defending” action. This is because our approach focuses on relative motion of moving objects which is discriminative between the two actions. In Table 1(a), some “left side defending” videos are misclassified as “stalemate” while the misclassifying rate is lower with split training strategy (see Table 1(b)). This is because the contents in some “left side defending” action videos are somewhat similar to the “stalemate” action. For those “left side defending” action videos, the left side try to intercept the ball. Although the significant motion of these videos is from the right side to the left, their local motion information is similar to that of “stalemate” action videos. If videos with these complex situation are trained, the predictive model is inaccurate and thus leads to misclassification. Robustness test. Different parameters δ and γ in the point grouping phase generate different affine transformation models and thus lead to different compensation results. To evaluate the sensitivity of our approach to compensation results, we used several different parameters. Parameter δ is set to five ranges: [−0.1, 0.1], [−0.2, 0.2], [−0.3, 0.3], [−0.4, 0.4], and [−0.5, 0.5], and γ is set to three functions of scores: mean scores, median scores, and average of the maximum and minimum scores to provide a total of 15 possible parameter combinations. Results are illustrated in Fig.5. As shown in the figure, compensation results influence recognition rates but the influence is slight. With different parameter combinations, all the final recognition rates remain above 85%. Although compensation results are different and thus affect relative motion information, thanks to the bag-of-words strategy, the quantized local motion (relative motion words) is slightly affected and it generates similar descriptors for group action representation. Therefore, our approach is able to achieve high accuracy with inaccurate compensation results.
0.95
accuracy
0.9 0.85 0.8 0.75 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
parametercombination
Fig. 5. Recognition rates using different compensation results
Learning Group Activity in Soccer Videos from Local Motion
111
Table 2. The performance of the approaches for recognizing group actions Method Kong et al.[4] Histograms S-OFHs Ours
Accuracy 86.4% 58.3% 52.1% 90.0%
Comparison experiments. We compared our approach with three global motion based-methods in [4], i.e. the proposed method in that paper, the histogrambased method and the SOFHs-based method. The last two methods were used as the comparison methods in [4]. The global motion in three comparative methods is derived by the optical flow but is described by different descriptors. They were tested on the dataset with 126 videos using the split training strategy. Follow their training strategy, we tested our approach and showed results in Table 2. From Table 2, we can see that our method outperforms the other three methods. This is because our local motion descriptor is more discriminative to recognize group actions than the comparative methods. Thanks to the local motion descriptor, our approach is able to capture player motion and exclude background motion caused by inaccurate compensation results. Thus, motion information in our approach is more accurate. The three competing methods utilize optical flow to represent global motion. Since optical flow is sensitive to noise, global motion in their methods is inaccurate and thus results in misclassification. We compared computational costs between our approach and the methods in [4]. Our approach (implemented in unoptimized C code) costs about 0.8 seconds in extracting features on one frame while the comparative methods (implemented using OpenCV) costs about 0.2 seconds.
5
Conclusion
We have proposed a local motion-based approach for recognizing group activities in soccer videos. Our approach focuses on local motion of individuals and excludes noise such as background motion caused by inaccurate compensation. A simple but effective method is proposed to group given keypoint matches to the background point set and foreground point set. To effectively represent group actions, we propose a novel approach that makes use of relative motion of individuals and represents group activities using a bag-of-words paradigm. Results show that our approach achieves high recognition rates and is robust to inaccurate compensation results.
Acknowledgement This work is partly supported by NSFC (Grant No. 60825204) and National High-Tech R&D Program of China (Grant No.2009AA01Z323).
112
Y. Kong et al.
References 1. Shi, J., Tomasi, C.: Good Features to Track. In: Proc. CVPR (1994) 2. Tomasi, C., Kanade, T.: Detection and tracking of point features. Technical Report CMU-CS-91-132, Carnegie Mellon University (April 1991) 3. Jung, B., Sukhatme, G.S.: Real-time motion tracking from a mobile robot. Technical report, University of Southern California (2005) 4. Kong, Y., Zhang, X., Wei, Q., Hu, W., Jia, Y.: Group action recognition in soccer videos. In: Proc. ICPR (2008) 5. Laptev, I., Lindeberg, T.: Space-time interest points. In: IEEE ICCV (2003) 6. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proc. ACM Multimedia (2007) 7. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS (2005) 8. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 9. Zhang, Z., Hu, Y., Chan, S., Chia, L.T.: Motion context: A new representation for human action recognition. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 817–829. Springer, Heidelberg (2008) 10. Liu, J., Shah, M.: Learning Human Actions via Information Maximization. In: Proc. CVPR (2008) 11. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. In: BMVC (2006) 12. Niebles, J.C., Fei-Fei, L.: A hierarchical model of shape and appearance for human action classification. In: IEEE CVPR (2007) 13. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79(3), 299–318 (2008) 14. Sch¨ uldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: IEEE ICPR (2004) 15. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. 9th Int. Conf. Computer Vision, vol. 2, pp. 726–733 (2003) 16. Zhu, G., Xu, C., Huang, Q., Gao, W.: Action Recognition in Broadcast Tennis Video. In: Proc. ICPR (2006) 17. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of Imaging Understanding Workshop, pp. 121–130 (1981) 18. Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998) 19. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm
Combining Discriminative and Descriptive Models for Tracking Jing Zhang, Duowen Chen, and Ming Tang National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China {jzhang84,dwchen,tangm}@nlpr.ia.ac.cn
Abstract. In this paper, visual tracking is treated as an object/background classification problem. Multi-scale image patches are sampled to represent object and local background. A pair of binary and one-class support vector classifiers (SVC) are trained in every scale to model the object and background discriminatively and descriptively. Then a cascade structure is designed to combine SVCs in all scales. Incremental and decremental learning schemes for updating SVCs are used to adapt the environment variation, as well as to keep away from the classic problem of model drift. Two criteria are originally proposed to quantitatively evaluate the performance of tracking algorithms against model drift. Experimental results show superior accuracy and stability of our method to several state-of-the-art approaches. Keywords: Tracking, SVM, cascade structure, performance evaluation.
1
Introduction
Visual tracking is an important computer vision problem and has been investigated during the past decades. Most existing algorithms are able to track object in short duration and in well controlled environments. However, tracking remains difficult in cases where the appearances of object and background are partially similar, or the appearance of both object and background undergoes rapid changing or deforming. Recently, discriminative approaches have opened a promising direction in tracking area. Avidan [1] uses an off-line learned support vector machine (SVM) as the classifier, embedding it into an optical-flow based tracker. The SVM is not updated online. Tang et al. [2] develop the above work by introducing an incremental learning scheme. Collins et al. [3] cast tracking as a binary classification problem in pixel level. Avidan’s ensemble tracker [4] online learns a strong classifier through Adaboost to label pixels in the next frame. Weak classifiers are learned through least squares method. A similar work is proposed by Grabner
This work is supported by National Nature Science Foundation of China. Grant No. 60835004 and 60572057. Corresponding author.
H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 113–122, 2010. c Springer-Verlag Berlin Heidelberg 2010
114
J. Zhang, D. Chen, and M. Tang
and Bischorf [5] which applies boosted feature selection algorithm to construct strong classifiers. Lu and Hager [6] propose to model the object and background with two sets of random image patches, respectively. In their algorithm, patches are randomly sampled around the object region estimated based on the last frame. Then a binary classifier is learned and used to classify patches sampled in the new frame. A confidence map depicting how well a patch belongs to the object is computed according to the classification, then the new estimated location is obtained. Single-scale patches are used during the entire tracking process, regardless of the actual scale of object. A classic problem in adaptive tracking is the specter of model drift [3], which will decrease the tracking accuracy and robustness. It is caused by the weak ability to eliminate some false positive samples from object bounding box. Such phenomenon is much more obvious in the discrimination-based tracking algorithms if the object is not rectangular whereas the tracking window is, or tracking window is larger than the object. Introducing bidirectional consistency check, Lu and Hager [6] partially overcome this drawback. Up till now, a great number of algorithms have been proposed to track objects in image sequences, and each has its own strengths and weaknesses. Therefore the performance evaluation of tracking algorithms is crucial for not only further development of algorithms, but the commercialization and standardization of techniques as well. Although there have been much work to investigate the quantitative evaluation of tracking algorithms (e.g., [7,8,9]), there is still no one to strictly evaluate the algorithms’ performance against model drift, to our best knowledge. 1.1
Proposed Algorithm
In this paper, we try to develop an effective method to greatly alleviate the negative effect of above classic problem on tracking. The key idea is to construct pairs of discriminative and descriptive models to represent the object. First, two sets of patches in different scales (neither as small as pixel, nor as large as the entire object) are sampled from the object and its local background, capturing both low and high level visual cues. Then, a pair of binary Support Vector Classifier (2-SVC) and one-class Support Vector Classifier (1-SVC) are trained for each scale to model the object. 2-SVC tries to identify the most discriminative object patches, while 1-SVC tries to maintain a descriptive model for object appearance. The combination of 1-SVC and 2-SVC is to keep both discriminative and descriptive object models, relieving the model drift problem greatly. A cascade-feedback structure is designed to produce a more reliable confidence map. Different strategies are adopted for relearning 2-SVC and incrementally updating 1-SVC. The algorithm flow is briefed in Figure 1. In this paper, we also propose two criteria to evaluate the performance of tracking algorithms against the drift of location and trajectory, and compare our method with other three state-of-the-art algorithms [4,3,6]. We believe that the model drift problem is one of the main reasons to cause the location drift and
Combining Discriminative and Descriptive Models for Tracking
115
Fig. 1. Flow Chart. Two scales of patches are extracted from frame t. For small patches, a combination of 2-SVC and 1-SVC obtained from frame t-1 are used to detect the probable object regions. Then, these regions are transferred to form large patches. Large patches are classified by large scale 2-SVC and 1-SVC to reduce the false positive rate. Meanwhile, the resulting object regions from large scale classifiers are back propagated to the small patches. Then tracking is conducted in the small patch level. After obtaining the object location, both 2-SVC and 1-SVC are updated in different patch scales.
trajectory offset. Therefore, the smaller the location drift and trajectory offset, the smaller the model drift is. Another reason we choose these three algorithms to compare is that these algorithms adopt similar philosophy to represent objects, great possibly avoiding different performance caused by different object representations. The rest of the paper is organized as follows. Section 2 gives a brief review of the basic component of our approach, two types of support vector classifiers. Section 3 explains the proposed algorithm in details; Section 4 shows experimental results. Section 5 gives the conclusion.
2
Background
2.1
Binary Support Vector Classifier (2-SVC) [10]
Given a training set X = {(x1 , y1 ), (x2 , y2 ), ..., (xN , yN )}, where xi ∈ Rd and yi = ±1, 2-SVC obtains an optimal classification hyperplane, f (x) = ω, Φ(x)+ b, in the kernel space through the following equation, N+ N− 1 2 min ω + C+ ξi + C− ξi , s.t. yi (ω, Φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0, (1) ω,ξ,b 2 i=1 i=1
where Φ(x) is a kernel mapping, and ξi s are slack variable. To deal with the unbalanced training problem, C+ and C− are introduced to give different penalty for false negative and false positive.
116
J. Zhang, D. Chen, and M. Tang
2.2
One-Class Support Vector Classifier (1-SVC) [11]
1-SVC only uses positive data to obtain an optimal classification hyperplane, f (x) = ω, Φ(x) − ρ. The optimization is as follows. i=N 1 1 2 min ω + ξi − ρ, s.t. ω, Φ(xi ) ≥ ρ − ξi , ξi ≥ 0, ω,ξ,ρ 2 vN i=1
(2)
where v ∈ (0, 1] is a parameter. Essentially, 1-SVC provides an approach to modeling the object descriptively.
3
Hybrid Kernel Machine Tracking
Our algorithm integrates 2-SVC and 1-SVC pairs in different scales through an cascade scheme. We refer to this approach as Hybrid Kernel Machine Tracking. The whole procedure is outlined in Alg.1.
Algorithm 1. Hybrid Kernel Machine Tracking (HKMT) - Initialization (for frame F0 ): given initial bounding box R0 , with center location X0 1. Acquire positive and negative samples in each of two patch scales. 2. Train 2-SVC in each scale using both positive and negative samples. 3. Train 1-SVC in each scale using samples with positive output of 2-SVC and within R0 . s s s s 4. hs (p) = f2−SV C (p)+f1−SV M (p), where f2−SV C (p) and f1−SV M (p) are the scores of 2-SVC and 1-SVC, respectively, in scale s. - For (each new frame Ft , t=1,2,. . . ,T): given last location Xt−1 , and hs (p), s ∈ {s0 , s1 }, s0 < s1 . 1. 2. 3. 4. 5. 6. 7. 8.
Sample patches in scales, s0 and s1 , from Ft to form patch set, Pts0 and Pts1 . Build a confidence map cs0 (xp , yp ) based on hs0 (p) using all patches in Pts0 . Filter patches in Pts1 using cs0 (xp , yp ) to eliminate definitely negative samples. Build a confidence map cs1 (xp , yp ) based on hs1 (p) using filtered patches in Pts1 . Feedback cs1 (xp , yp ) to cs0 (xp , yp ), and get a refined cs0 (xp , yp ). Locate the object in current frame using the confidence map cs0 (xp , yp ). Sample patches in every scale, according to the new location. Train new 2-SVC and incrementally learn 1-SVC.
- end For
3.1
Initialization
We manually label the object to be tracked with a bounding box, R0 , centered at X0 in the first frame. Multi-scale patches are uniformly sampled around and within R0 . Patches totaly inside the box are defined as object patches (positive
Combining Discriminative and Descriptive Models for Tracking
117
samples), and the left ones are background patches (negative samples). A feature vector for each patch is constructed. Then a pair of 2-SVC and 1-SVC are trained in each scale. See Initialization of Alg.1 for details. 3.2
Locating Object
For a newly arrived frame t, we randomly sample patches in two scales with uniform distribution from the image region around estimated location Xt−1 , forming two patch sets, {p}st , s ∈ {s0 , s1 }. Feature vectors are constructed for sampled patches. Patches are classified by hs (p), s ∈ {s0 , s1 }. In order to combine the classification results of patches of different scales, a cascade structure is used. Cascading Different Scale Classifiers. The cascade structure of classifiers is proposed by Viola and Jones [12]. Such structure is employed in HKMT with the adaptation to our problem. Our cascade scheme includes two steps. In the first step, the patches in {p}st 0 are classified by hs0 (p). hs0 (p) is mapped to a confidence measure, cs0 (xp , yp ), as follows. ⎧ hs0 (p) ≥ 1; ⎨ 1, s0 s0 c (xp , yp ) = h (p), −1 < hs0 (p) < 1; (3) ⎩ −1, otherwise, where (xp , yp ) is the center of p. cs0 is called s0 -confidence map. In the second step, s0 -confidence map is used to filter patch set {p}st 1 . p ∈ {p}st 1 is fed to hs1 (p), if it satisfies the following inequality: cs0 (u, v) > 0. (4) (u,v)∈p
If (u, v) is not the center of p ∈ {p}st 0 , cs0 (xp , yp ) = 0. Then cs1 (xp , yp ) are evaluated with a similar way to equation (3) to generate s1 -confidence map, cs1 . Locating Object. In principle, the object can be located in the new frame with cs1 . This confidence map, however, is much sparser than cs0 , resulting unstable tracking. Usually, cs0 contains more false positive and false negative than cs1 does, due to the weak discriminativity of small scale patches. Large scale patches that cover small ones can tell “whether the covered patches is positive or negative”. Therefore, cs0 and cs1 are fused to generate the refined cs0 . Because there may be many large patches which cover the same small one, the patch with the largest hs1 (p) is transferred to the small one. The transferred value for small scale patch centered at (u, v) is t(u, v) = max hs1 (p), p
(u, v) ∈ p.
Combining this transferred value, the refined cs0 is defined as follows. t(xp , yp ) + cs0 (xp , yp ), t(xp , yp ) > 0; s0 c (xp , yp ) = t(xp , yp ), otherwise.
(5)
(6)
118
J. Zhang, D. Chen, and M. Tang
Fig. 2. Three confidence maps for small patches, large patches, and refined ones, respectively, from left to right. It is clear that the refined confidence map contains more true positive patches and fewer true negative ones. Besides, it covers more object regions.
Fig. 2 shows two example of refined cs0 . Similar to [13,3,6], object location, Xt , is determined by mean-shift [14] with refined cs0 . 3.3
Updating 2-SVCs and 1-SVCs
Preparing New Samples. To update the classifiers, we first sample patches from the current frame in two scales. If the background contains little clutter, the bounding box is a good indication for the label of samples. However, the bounding box usually contains much clutter. In our HKMT, confidence maps obtained above are used to label samples. 1, csi (xp , yp ) > 0 and (xp , yp ) ∈ Rt , Lsi (p) = (7) −1, otherwise, where Lsi is the label of patch, p, in scale si . i ∈ {0, 1}. Updating through Sample Pruning. For updating 1-SVCs, samples older than T frames are removed from training set by decremental learning [15]. New frame samples are added in an incremental way. For 2-SVCs, we only use the most current frame samples for re-training, because the objective of 2-SVCs is to capture the instant change of tracking environment.
4
Experimental Results
We implemented the proposed approach in C++ and tested it on challenging sequences. The object location is labeled manually with a bounding box in the first frame. Only color/intensity features are employed in our experiments. For each patch, three color histograms, 8 bins each one, are constructed for R, G and B, respectively. Then a total 24 dimensional feature vector is constructed
Combining Discriminative and Descriptive Models for Tracking
119
for each patch. For grey image, an 8-bin histogram is constructed. The patch scales are variable with respect to the object bounding box. Two scales are used in our experiments: 10 ∼ 15 pixels squared as small patches, and 1/4 ∼ 1/2 of bounding box in both height and width as large patches. Sampling rate is normally 1% ∼ 16%. 4.1
Accuracy and Stability
In order to evaluate the performance of algorithms against the model drift problem, accuracy and stability are introduced in this paper. Accuracy evaluates the maximal offset of the trajectory away from the ground truth, and stability evaluates the consistency of the trajectory variation with the ground truth. Formally, given the ground truth center, (Gtx , Gty ), and the output center, (Oxt , Oyt ), of the object in frame t, the accuracy and stability on a clip of T frames are defined, respectively, as follows. A = max
1≤t≤T
4.2
(Oxt − Gtx )2 + (Oyt − Gty )2 , S =
1 2T
[(Oit − Gti ) − (Oit−1 − Gt−1 )]2 . i
i=x,y 1≤t≤T
Single-Scale Patch vs. Multi-Scale Patch
In this subsection, the effectiveness of multi-scale patch sampling is illustrated. Several frames of a challenging sequence are shown in Fig.1. The hand-held camera is extremely unstable, and the object undergoes scale change continually. While only using single scale patches in Alg.1, the accuracy and stability are inferior to using two-scale patches. See Table 1 and Fig.3. The reason that Table 1. The accuracy and stability comparison between single and multiple scale patch schemes for the sequence in Fig.3, which includes 88 frames. The stability grows with the scale of patches. But the multi-scale scheme performs superiorly to the single scale one. Single Scale Multi-Scale 11 × 11 15 × 15 21 × 21 11 × 11 + 21 × 21 Accuracy 24 21 18 14 Stability 16.26 13.32 9.45 7.01
Fig. 3. Multi-scale patch sampling (red) vs. single-scale patch sampling (blue). Tracking results on frames 12, 27, 39 and 81 are shown from left to right. Confidence map is also marked on the frames.
120
J. Zhang, D. Chen, and M. Tang
multi-scale sampling scheme is superior to single-scale one is that the multi-scale scheme relies on a refined lower layer confidence map to locate the object. On the one hand, it efficiently eliminates the “false positive” small patches under the help of large patches; on the other hand, it covers object region denser than the confidence map produced by the single-scale scheme. 4.3
2-SVC vs. 2-SVC+1-SVC
If the object undergoes drastic scale changes, only employing 2-SVC in Alg.1 may fail to track the object. Two examples (362 frames and 726 frames) are shown in Figs.4 and 5, respectively. In Fig.5, the object undergoes both scale changes and occlusion. Single 2-SVC scheme misses the object at frame 30 and 322, respectively, in two examples. The comparison is summarized quantitatively in Table 2.
Fig. 4. 2-SVC (blue) vs. 2-SVC+1-SVC (red). Tracking results are shown on frames 544, 566, 570, 644, and 674. The object continually undergoes scale change. The model drift is severe in 2-SVC scheme.
Fig. 5. Tracking results on frames 1, 322, 386, 424, 524 and 726 with 2-SVC+1-SVC scheme. The object undergoes continual scale change and occlusion. Single 2-SVC fails to learn a model at frame 322 because the object is similar to background. Table 2. The accuracy and stability comparison between 2-SVC and 2-SVC+1-SVC schemes for two sequences, Fig.4 (362 frames) and Fig.5 (726 frames) 2-SVC Fig.4 Fig.5 Accuracy miss miss Stability miss miss
1-SVC+2-SVC Fig.4 Fig.5 14 7 7.21 4.16
Combining Discriminative and Descriptive Models for Tracking
4.4
121
Comparison with State-of-the-Art Trackers
We also compare our algorithm with several state-of-the-art algorithms, including Collins et al. algorithm [3], Avidan’s ensemble tracking [4] and Lu and Hager’s algorithm [6].1 Clips 1 to 4 include 88, 118, 118, and 118 frames, respectively. Observing that the performance of tracking algorithms are often initializationdependent, we set two different kinds of initialization. The first is that the bounding box just fits the ground truth outline of the object. The second is that a larger bounding box is used. All sequences shown in Fig.6 are initialized with larger bounding box. Each of four algorithms is run for ten times on each clip. The average performance is reported in Table 3. It is noted from Table 3 that the accuracy and stability of our algorithm are almost always superior to those of other three ones. Whereas, there are two exceptions. For clip 2 and 3, Collins’ tracker performs better than ours in stability. The reasons are (1) backgrounds contain few clutters, and so it is relatively easy to learn a classifier, and (2) the much lower sampling rate of our algorithm makes the trivial difference. When the bounding box is enlarged, our algorithm outperforms Collins’ by successfully eliminating false positive portions inside the object box. For other clips 5 and 6 (several frames are shown in Fig. 6), the backgrounds are extremely cluttered, and the object undergoes drastic deforming. Due to the weak ability to eliminating outliers in bounding box, all the algorithms [3,4,6] fails to track the whole sequences, while our algorithm tracks objects accurately and stably. Table 3. The accuracy and stability comparison among three state-of-the-art algorithms and ours on clips 1 to 4. The top part is for accuracy, and the bottom one for stability. The accuracy and stability of our algorithm are almost always superior to those of other three ones. The text analyzes two exceptions on clips 2 and 3 with fitted bounding box. Fitted Bounding Box clip 1 clip 2 clip 3 clip 4 65 × 35 50 × 25 27 × 27 28 × 36 Collins [3] miss 17 7 8 Avidan [4] 19 26 17 12 Lu [6] 24 20 13 9 Ours 14 14 5 8 Collins miss 3.24 1.56 1.81 Avidan 17.91 7.98 5.98 7.00 Lu 16.34 8.2 2.29 2.45 Ours 7.01 4.76 1.78 1.76
1
Enlarged Bounding Box clip 1 clip 2 clip 3 clip 4 80 × 50 66 × 37 38 × 38 38 × 51 miss 48 23 miss 64 miss miss miss 51 30 15 14 18 24 5 9 miss 7.76 2.25 miss 56.45 miss miss miss 44.56 8.65 4.49 4.70 9.23 4.57 2.19 3.23
More experimental results are in the supplemental materials.
122
J. Zhang, D. Chen, and M. Tang
Fig. 6. Tracking Results for clip 5, 6. our algorithm tracks objects accurately and stably, while others fail.
5
Conclusion
We have proposed an algorithm for accurate and stable tracking in complicated scenes. By combining discriminative and descriptive models and multiscale patches into a cascade scheme, our approach alleviates the classic problem of model drift greatly. Extensively quantitative and qualitative comparison to several state-of-the-art algorithms demonstrates superior performance of HKMT on accuracy and stability to others.
References 1. Avidan, S.: Support vector tracking. PAMI 26(8), 1064–1072 (2004) 2. Tang, F., Brennan, S., Zhao, Q., Tao, H.: Co-tracking using semi-supervised support vector machines. In: ICCV (2007) 3. Collins, R., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Trans. on PAMI 27(10), 1631–1643 (2005) 4. Avidan, S.: Ensemble tracking. IEEE Trans. on PAMI 29(2), 261–271 (2007) 5. Grabner, H., Bischorf, H.: On-line boosting and vision. In: CVPR (2005) 6. Lu, L., Hager, D.: A nonparametric treatment for location/segmentation based visual tracking. In: CVPR (2007) 7. Bashir, F., Porikli, F.: Performance evaluation of object detection and tracking systems. In: IEEE Int’l Workshop on PETS (2006) 8. Yin, F., Makris, D., Velastin, S.: Performance evaluation of object tracking algorithms. In: IEEE Int’l Workshop on PETS (2007) 9. Pound, M., Naeem, A., French, A., Pridmore, T.: Quantitative and qualitative evaluation of visual tracking algorithms. In: IEEE Int’l Workshop on PETS (2007) 10. Cristianini, N., Shawe-Taylar, J.: An introduction to support vector machines. Cambridge University Press, Cambridge (2000) 11. Scholkopf, B., Platt, J.C., Shawe-Taylar, J., Smoia, A.J., Williamson, R.C.: Estimating the support of a highdimensional distribution. Neural Computation 13(7), 1443–1471 (2001) 12. Viola, P., Jones, M.: Robust real-time object detection. In: CVPR (2001) 13. Avidan, S.: Ensemble tracking. In: CVPR (2005) 14. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. on PAMI 25(5), 564–577 (2003) 15. Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector machine learning. NIPS (2000)
From Ramp Discontinuities to Segmentation Tree Emre Akbas and Narendra Ahuja Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign
Abstract. This paper presents a new algorithm for low-level multiscale segmentation of images. The algorithm is designed to detect image regions regardless of their shapes, sizes, and levels of interior homogeneity, by doing a multiscale analysis without assuming any prior models of region geometry. As in previous work, a region is modeled as a homogeneous set of connected pixels surrounded by ramp discontinuities. A new transform, called the ramp transform, is described, which is used to detect ramp discontinuities and seeds for all regions in an image. Region seeds are grown towards the ramp discontinuity areas by utilizing a relaxation labeling procedure. Segmentation is achieved by analyzing the output of this procedure at multiple photometric scales. Finally, all detected regions are organized into a tree data structure based on their recursive containment relations. Experiments on real and synthetic images verify the desired properties of the proposed algorithm.
1 Introduction Low-level image segmentation partitions a given image into regions which are characterized by some low-level properties of their constituent pixels, where the term “lowlevel” refers to local and intrinsic properties of pixels such as gray-level intensity (or color), contrast, gradient, etc. For a set of connected pixels to form a region, they should have a certain degree of interior homogeneity and a discontinuity with their surroundings, where the magnitude of discontinuity is large compared to the interior variation. Our goal is to detect image regions regardless of their shapes, sizes and levels of interior homogeneity. These goals preclude the use of any prior model about region shape or geometry, and the fact that a region can have any level of homogeneity requires us to do multiscale analysis. Furthermore, we want the algorithm to work without requiring any image-dependent parameter tuning. Achieving these goals are challenging because of the nature of discontinuities that separate regions. A sharp edge in 3D world might be mapped to a wide ramp discontinuity in the image due to defocussing and penumbral blur in the image acquisition process. Hence, region boundaries in images, which can have arbitrary shapes, are surrounded by ramp discontinuities of various widths and heights. In the context of our study, we can classify the previous work as either not being multiscale, or imposing models on the geometry of edges. The earliest approaches to segmentation such as thresholding [8], region-growing [8] and watersheds [11] ignore the multiscale aspect of the problem. Energy minimization based approaches such as Markov Random Field modeling [7] and active contours [14] enforce constraints on the H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 123–134, 2010. c Springer-Verlag Berlin Heidelberg 2010
124
E. Akbas and N. Ahuja
local shape of regions; therefore, they are not capable of detecting arbitrarily shaped regions. Graph theoretical methods such as Normalized Cuts [13] and graph cuts [15] requires the number of regions to be given as input, which does not guarantee the detection of regions at all scales. Clustering methods attempt to find regions as clusters in the joint space of pixel positions and some features, (e.g. intensity, texture, etc.) extracted from pixels. For this, they either need the number of regions or some density estimation parameters [3,4], as input. As discussed in [2], mean-shift based segmentation cannot detect steep corners due to averaging and tends to create multiple boundaries, hence many small regions, for the blurred edges in the image. In recent years, many segmentation algorithms have been developed by aiming to maximize performance on the Berkeley Segmentation Benchmark Dataset [9] which contains images segmented by humans. We note that these are object-level segmentations and many regions in the images are not marked (i.e. many edges are not marked even if there is strong visual evidence for an edge, or some edges are marked where there is too little or no visual evidence). It is not our goal, in this study, to segment objects out, instead we aim to detect low-level image structures, i.e. regions, as accurately and completely as possible so as to provide a reliable and usable input to higher level vision algorithms. To this end, we follow the line of research in [1] and [2], and develop a new algorithm to achieve the aforementioned goals. As in [1], we use gray-level intensity and contrast as the low-level properties to define and detect low-level image structures. We define an image region as a set of connected pixels surrounded by ramp discontinuities, as done in [2]. We model ramp discontinuities with strictly increasing (or decreasing) intensity profiles. Each ramp discontinuity has a magnitude, or contrast, which allows us to associate a photometric scale with each boundary fragment surrounding regions. We achieve a multiscale segmentation over a range of scales, by progressively removing boundary fragments whose photometric scales are less than the current scale of analysis. Finally, all regions detected at all photometric scales are organized into a tree data structure according to their recursive containment relations. In this paper, we propose a new method, called the ramp transform, for detection of ramps and estimation of ramp parameters. At a given pixel, we analyze multiple intensity profiles passing through the pixel in different directions, and estimate the magnitude of the ramp discontinuity at that pixel by minimizing a weighted error measure. After applying the ramp transform, seeds for all image regions are detected and these seeds grow to become complete regions. Our contributions are: (1) A new segmentation algorithm which detects image regions of all shapes, sizes and levels of intensity homogeneity. It arranges regions into a tree data structure which can be used in high level vision problems. The algorithm gives better results and is less sensitive to image-dependent parameter tuning, compared to the existing popular segmentation algorithms. (2) A new transform, called the ramp transform, which takes an image and outputs another image where the value at each pixel gives a measure of ramp discontinuity magnitude at that pixel in the original image, is proposed. (3) A new ground-truth dataset for low-level image segmentation. To the best of our knowledge, this is the first dataset of its kind for such purpose. This dataset could also be used as a benchmark for edge detection algorithms.
From Ramp Discontinuities to Segmentation Tree
125
The rest of the paper is organized as follows. Section 2 describes the ramp and region models, the ramp transform and the segmentation algorithm. In Section 3, we present and discuss experimental results, and the paper is concluded in Section 4.
2 The Models and The Algorithm A set of connected pixels, R, is said to form a region if it is surrounded by ramp discontinuities, and the magnitudes of these discontinuities are larger than both the local intensity variation and the magnitudes of discontinuities, within R. To elaborate this definition, we first describe the ramp discontinuity model and the ramp transform. Ramp model. Consider the 1-dimensional image f given in Fig.1(a). A ramp is characterized by its strictly increasing (or decreasing) profile. So, the part of the curve between e1 and e2 is a ramp. The width of the ramp is |e1 −e2 |, and its magnitude is |f (e2 )−f (e1 )|. Additionally, we define two measures, “ramp (a) (b) quality” and “point-magnitude”, which will Fig. 1. (a) Ramp model. (b) Ramp trans- help us to generalize the ramp model to 2D form of f (x). Ci is equal to |f (i + a) − functions. f (i−a)| where a = min{|i−e1 |, |i−e2 |}. We define the “ramp quality” as the ratio between its magnitude and width, namely: |f (e2 )−f (e1 )| . The “point-magnitude” at location i is defined as: |e1 −e2 | Ci = |f (i + a) − f (i − a)|
(1)
where a = min{|i − e1 |, |i − e2 |}. Note that the ramp magnitude and point-magnitude are different measurements. Both are equal only when the point i is located at equal distances to the endpoints, e1 and e2 , and f (x − i) is an odd function. In 2D images, computing the point-magnitude of the ramp discontinuity at i , i.e. C(ii), is not a trivial task as it is in the 1D case. This is because an infinite number of lines, hence ramp intensity profiles, pass through the pixel i . We assume that a pixel i is within a ramp discontinuity if it has at least one strictly increasing (or decreasing) intensity profile passing through it. 2.1 Ramp Transform The transform converts the input image I to a scalar height field C. The height at location i in C corresponds to the point-magnitude of the ramp discontinuity at i in the original image I. If I is a 1D image, computing the ramp transform amounts to computing Ci for all i by setting f = I in eq.(1). The ramp transform of the ramp of Fig.1(a) is given in Fig.4. If I is a 2D image, an infinite number of lines passes through i , and each of these lines has its own intensity profile. To parametrize these lines –and their corresponding intensity profiles– let us define an angle θ which is the angle that the line makes with a
126
E. Akbas and N. Ahuja
horizontal row of the image. For a finite set of angles in [0, 2π), we analyze the intensity profiles and measure the corresponding ramp parameters. For the ramp discontinuity at angle θ, let qiθ be its ramp quality and ciθ its point-magnitude at i . It is tempting to set Ci = max ciθ but it gives us noisy estimates since ramp end θ
points are affected by the noise present in images. To be robust to this noise, we estimate Ci in a weighted least-squares setting using the ramp quality measures as weights: Cˆi = min
θ θ qi (Ci
− ciθ )2 =⇒ Cˆi =
θ θ q c θ i θ i θ qi
(2)
In the following, we drop the hat of Cˆi , and use Ci . 2.2 Obtaining Seeds for Regions The output, C, of the ramp transform contains point-magnitudes of the ramp discontinuities found in I. In this section, we first describe our region model and then elaborate on how seeds for all image regions are detected from C. Region model. A set of connected pixels, R, is said to form a region if: 1) it is surrounded by ramp discontinuities, 2) the magnitudes of these discontinuities are larger than both the local intensity variation and the magnitudes of discontinuities, within R. To obtain regions that conform with the above definition, we look for the basins of the height field C. To find all basins, all photometric scales, i.e. contrast levels, are traversed from lower to higher and new basins are marked as they occur. Then, the set of basin pixels, S, contains the seeds for image regions. The remaining set of pixels, D, correspond to the ramp discontinuity areas, and we call these pixels as ramp pixels. We find the connected-components in the set S, and label each component, which corresponds to a distinct basin of C, with a unique label. If there are N connectedcomponents, then each pixel in the set S takes a label from the set L = {1, 2, 3, . . . , N }. 2.3 Region Growing by Relaxation Labeling Having obtained the regions seeds (S), we want to grow them by propagating labels towards the ramp discontinuity areas (D) which are unlabeled. For this purpose, we use a relaxation labeling [12] procedure. Although the classical watershed transform might be utilized here, we choose not to use it because it does not give good edge-location accuracy at corners and junctions. Let i ← denote the event of assigning label to pixel i , and P (t) (ii ← ) denote the probability of this event happening at iteration t. Relaxation labeling iteratively updates these probabilities so that the labeling of pixels get more consistent as the method iterates. Next, we describe how we compute the initial probability values. Computing Priors. For a pixel that is part of any detected region seed, we define the prior probabilities as P (0) (ii ← ) = 1 if i ∈ R (0, else) ∀ii ∈ S, ∀ ∈ L. On the other hand, the prior probabilities of the pixels which are within the ramp discontinuities, i.e. those i ∈ D, are not trivial. To compute these priors, we design a cost function for assigning label to pixel i , i.e. the event i ← , as follows.
From Ramp Discontinuities to Segmentation Tree
127
Consider the scenario given in Fig.2. Pixel i is within the ramp discontinuity area among regions R1 , R2 , and R3 . The point j 1 is the closest point to i , in region R1 (similarly j 2 for R2 and j 3 for R3 ). Let ij1 denote the line segment connecting i and j 1 . We compute the cost of assigning label 1 to ramp pixel i by analyzing the intensity profile along the line Fig. 2. Computing the initial probabilities segment ij1 . The cost function is designed in for relaxation labeling. such a way that the flatter the profile is, the lesser the cost, and vice versa. We achieve this by summing up finite differences of the intensity profile at regular intervals. Formally, the cost of assigning label to a ramp pixel i is given by: ii −jj /h
G(ii ← ) =
|Ii +nhuu − Ii +(n−1)huu |
(3)
n=1
where j is the closest pixel to i such that j ∈ R , h is a small stepsize, ii − j is the −ii distance between i and j , and u = iji −j j , a unit vector. To compute prior probabilities for a ramp pixel i , we use: P (0) (ii ← ) =
−1 i←) G (i −1 (i i←k) , for k∈L G
∀ii ∈ D, ∀ ∈ L.
(4)
Relaxation labeling. Once the probabilities are initialized by P (0) (·),we iteratively update them by the following relaxation labeling update rule: P (t+1) (ii ← ) =
(t) i←)(1+Q(t) (ii←)) P (i (t) (i i←k)(1+Q(t) (ii←k)) P k∈L
(5)
where Q(·) is defined as: Q(t) (ii ← ) =
1 |Ni |
j ∈Ni
k∈L
Rij (, k)P (t) (jj ← k).
(6)
Here Ni denotes the neighbors of pixel i and Rij (, k), called the compatibility function, gives a measure of how compatible the assignments “ii ← ” and “jj ← k” are. The constraint on R(·) is that it should return values in the interval [−1, +1]: 1 meaning that the two events are highly compatible, −1 meaning just the opposite. We choose the following form: Rij (, k) =
⎧ |I −I | ⎨e − i s j
, = k
⎩e
, =k
M−|Ii −Ij | − s
(7)
where M is the maximum value that |Ii − Ij | can take for any I, i , j . It is 255 for standard 8-bit images. This compatibility function forces the neighboring pixels with similar intensities to have the same labels.
128
E. Akbas and N. Ahuja
Final labeling of ramp pixels. When the highest change in any P (t) (·) becomes very small, we stop the iterations and label the ramp pixels with the labels having the maximum probabilities: i ← arg max P (t) (i ← ).
2.4 Multiscale Segmentation After relaxation labeling, every pixel in the image has a label, hence every pixel has joined one of the seeds detected after the ramp transformation step. Now we analyze these regions at a finite set of photometric scales, i.e. contrast levels, and produce multiscale segmentation of the image. Segmentation of I at a given contrast level σ is defined as the partitioning of I into regions where all boundary fragments have a photometric scale larger than or equal to σ. A boundary fragment is defined to be a connected set of boundary pixels separating two neighboring regions. For a given contrast level σ, a fragment f is said to have lower photometric scale than σ, if it satisfies: |f1 | p ∈f I{C(pp)<σ} < α where |f | denotes the length of the fragment and α is a small constant (in all our experiments we set it to 5%). This criteria allows boundary fragments to have some amount of weak edge pixels, i.e. those having contrast less than σ.
(a)
(b)
(c)
(d)
(e)
1∪2 1
2
3 5 7 8 9 4 6 (f)
(g)
(h)
Fig. 3. Illustration of steps in the algorithm. (a) Input image I. (b) Output of ramp transform, C, applied to I. Here, the darker the pixel, the higher the contrast of the underlying ramp. (c) Basins of C. Each basin is represented with a different color. These basins correspond to region seeds and the remaining pixels are ramp pixels (white color). (d) Final labeling obtained by growing the region seeds towards the ramp pixels using relaxation labeling. (e,f,g) Results of multiscale segmentation. (e) Segmentation result for photometric scale σ = 5. All regions are included. (f) Segmentation for σ = 65. Two regions (head and the body) merged. This means that the photometric scale of the boundary fragment in between the two merging regions is less than 65. (g) Segmentation for σ = 80. More regions have disappeared. The remaining regions are of photometric scale larger than σ = 80, ensured by the region model and the algorithm. (h) Segmentation tree. On the left, each region is labeled by a number. Using the containment relations of regions, our algorithm computes the tree given on the right hand side.
From Ramp Discontinuities to Segmentation Tree
129
Regions at all photometric scales are obtained by starting from the lowest level photometric scale and removing boundary fragments having contrasts less than σ, progressively as σ increases. This process, which is an agglomerative clustering of regions according to the inequality above, ensures that remaining regions always conform with the region model and successive merges create a strict hierarchy. 2.5 Constructing the Segmentation Tree Due to the nature of multiscale segmentation process described above, regions merge as the scale of analysis, σ, is increased. This allows us to arrange the regions into a tree data structure. Suppose that regions R1 and R2 at photometric scale σn have merged and become R3 at scale σn+1 . Then, in the segmentation tree R1 and R2 should be the children of R3 . Applying this rule recursively for all regions, we obtain a tree of regions, called the segmentation tree, where the root node corresponds to the entire image itself. We illustrate all steps of the algorithm on a simple synthetic image, in Fig.3.
3 Experiments and Results We first demonstrate the ramp transform. To show that it can correctly measure the contrast of the underlying ramp edge, we created a synthetic image containing two edges of the same intensity contrast (Fig.4(a)). One of the edges is a step-edge, and the other is a ramp edge. Although they have different widths, we expect that the ramp transform gives similar values for these two edges. Fig.4(a) illustrates this property. Fig.4(b) illustrates the ramp transform on a real image (taken from [5] with permission). This image contains ramp edges of various widths. A fixed-length gradient filter is incapable of measuring the ramp magnitudes correctly (see Fig.4(b), bottom). The ramp transform successfully estimates the pointwise ramp magnitudes as expected. Next, we describe how we quantitatively compared our segmentation algorithm with available algorithms.
(a)
(b)
Fig. 4. Illustration of the ramp transform. (a) Top: A synthetic image containing a sharp step edge, on the left, and a wide ramp edge on the right, which was obtained by blurring a step edge by a Gaussian kernel of σ = 4. The contrasts of both edges are 100. Middle: Ramp transform of the synthetic image. For both edges the peak value of the response of the transform is 100, which is the contrast of the edges. Bottom: Gradient magnitude, obtained by horizontal and vertical [−1 +1] filters. The responses for two edges are not the same. (b) Ramp transform of a real image. Left: An image containing ramp discontinuities of varying widths. Middle: Ramp transform of the image. Right: Gradient magnitude of the image.
130
E. Akbas and N. Ahuja
3.1 Quantitative Comparison with Other Algorithms Creating a ground truth dataset for low-level image segmentation is a challenging task because 1) Segmenting images by hand is a laborious process. 2) Humans use their high-level semantic knowledge while segmenting images, and it is difficult to eliminate this bias. This is why we do not use the Berkeley segmentation dataset [9], where human subjects were asked to segment out objects (see paragraph 4 in Section 1). Our goal in this study is not object-level segmentation, instead we want to detect all regions as accurately and completely as possible. The dataset. We address the challenges stated above by having human subjects segment small image patches instead of whole images. This makes segmentation-by-hand a much easier process and removes the high-level knowledge bias to a large extent because a small image patch is unlikely to contain objects. We also randomly rotate the image patch (at multiples of 90 degrees) to further reduce this bias. We developed a GUI which, given an image, displays a random patch and allows the user to segment it by drawing polygonal lines.
(a)
(b)
(c)
Fig. 5. (a) An image from the dataset. (b) The patch represented by the upper yellow square on (a) and its ground-truth segmentation. Note that the patch is rotated 90 degrees clockwise. (c) The other patch and its ground-truth segmentation.
We created such a dataset using a set of 15 images. We fixed the size of the patches at 50×50 pixels and randomly extracted 50 patches per image, thus obtaining 750 patches in total. Image sizes range approximately from 250 × 250 to 500 × 500 pixels. We used 5 human subjects to hand-segment these patches. Each patch was segmented by a single human subject. Since patch extraction was random, patches might overlap. Fig.5 shows an image from our dataset and two example patches randomly extracted from it. Human segmentations for these patches are also given. Performance measure. We evaluate the performance of a segmentation algorithm by looking at its segmentation accuracy over the patches where ground-truth segmentations are provided by human subjects. The segmentation accuracy for a single patch is measured by precision-recall values obtained by comparing the ground-truth and machine’s output. We describe this process with an example. Consider the image in Fig. 6(a) and its patch in Fig. 6(b) with its ground-truth in Fig. 6(c) (call this as G). Suppose that our segmentation algorithm produced the result given in Fig. 6(d), thus the segmentation corresponding to the patch is Fig. 6(f) (let this be R). Now, we compare G and R and
From Ramp Discontinuities to Segmentation Tree
(a)
(f)
(b)
(c)
(g)
(d)
(h)
131
(e)
(i)
Fig. 6. Illustration of the performance measure. (a) An image from our dataset. A patch is marked by the red square. (b) Magnified version of the patch. (c) Provided human segmentation for the patch. (d) Segmentation of (a) obtained by our algorithm. (e) Segmentation of (a) obtained by the mean-shift based algorithm. (f) Our result at the location of the patch (red square in (d)). (g) Matching result between the ground-truth (c) and our result (f). Red-pixels represent the ground-truth, blue pixel represent algorithm’s output (our result, in this case). White lines denote matching pixels. (h) Mean-shift based method’s result at the location of the patch (red square in (e)). (i) Matching result between (c) and (h). See (g) for explanation.
find correspondences between the boundary pixels in R and G. Each boundary pixel in G either matches with a single boundary pixel in R, or does not match with anything at all. To find the optimal matching between G and R, we use the method described in Appendix B of [10], which casts the problem as a minimum cost bipartite assignment problem. The result of matching is given in Fig. 6(g) where red pixels denote the boundary pixels of G, and blue pixels denote the boundary pixels of R. The white line between a red pixel and a blue pixel indicates a match. As done in [10], we measure the goodness of the match by precision-recall. If the image patch is considered as a query, G as its relevant (ground-truth) result, and R as the retrieved result (by the algorithm), we could compute precision-recall as: Recall r = |relevant ∩ retrieved| ∩ retrieved| , and Precision p = |relevant . We combine precision and |relevant| |retrieved| 2pr recall using the F-measure defined as: f = p+r . For the matching result of Fig. 6(g), precision-recall and F-measure values are p = 0.61, r = 0.66, f = 0.63, whereas for Fig. 6(i) they are p = 0.31, r = 0.95, f = 0.47. Note that for the latter case, the recall is very high since only a few red pixels are unmatched, but the precision is low because there are plenty of blue pixels that are unmatched.
Comparison. We compared our algorithm with two available algorithms which are widely used in the literature: Felzenszwalb’s graph-based algorithm [6] and the meanshift algorithm [3]. We did not include N-Cuts [13] because it needs the number of regions as input, which is an explicit image-dependent parameter, and unknown apriori. Both algorithms requires 3 input parameters from the user and we do not know which values to use. So, we sample a large number of input parameters for both algorithms.
132
E. Akbas and N. Ahuja
Mean-shift based segmentation method [3] requires a spatial bandwidth σs , a range bandwidth σr , and a minimum region area a. We selected the following input parameter space: {σs , σr , a} ∈ {5, 7, 9, 11, 15, 20, 25} × {3, 6, 9, 12, 15, 18, 21, 24, 27} × {5, 10}. The graph-based algorithm [6] expects a smoothing scale σ, a threshold k which is the scale of observation (equation (5) in [6]), and a minimum region size a, as input. We used the following parameter space: {σ, k, a} ∈ {0.5, 1, 1.5, 2} × {250, 500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500} × {5, 10}. Our algorithm outputs a hierarchy of regions. However, in order to compute its segmentation accuracy and compare it with other algorithms, we need single-layer segmentation results. For this purpose we used the photometric segmentation outputs at scales σ = {10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38}. First, we report best average F-measure results per image (BAFPI). To get this for an image, we compute F-measures for all patches of that image and we average these results for each element in the input parameter 50 space. Formally, BAFPI for image I 1 and algorithm A is BAFPI(I, A) = max 50 i=1 F (Ii , A(I, p )i ) where S is the input p ∈S
parameter space of A, Ii is the ground-truth for the ith patch of I, A(I, p ) is the segmentation result of A applied on I with parameters p , A(I, p )i is the ith patch of the segmentation result, and F (ptch1 , ptch2 ) gives the F-measure between the ground-truth ptch1 and machine output ptch2 . BAFPI results are given in Table 1. Our algorithm outperforms the other two algorithms on all except three images. Table 1. Best average F-measure per image (BAFPI) results Images → 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 Avg.
Graph-based 0.47 0.68 0.55 0.55 0.56 0.67 0.66 0.51 0.72 0.54 0.51 0.67 0.63 0.67 0.68 0.60 Mean-shift 0.62 0.80 0.60 0.58 0.70 0.78 0.79 0.63 0.72 0.67 0.65 0.69 0.77 0.76 0.72 0.70 Our method 0.74 0.86 0.69 0.67 0.75 0.81 0.82 0.68 0.79 0.65 0.68 0.65 0.81 0.80 0.71 0.74
The results in Table 1 are useful to show us the best these algorithms could do per image. On the other hand, these results are unrealistic because for each image, it assumes that we know the best input parameters to use, which is not true in practice. Therefore, we next look at what happens if we use the same input parameters for all images. We report the best average F-measures (BAF) per algorithm (along with the corresponding precision and recall values) in Table 2. BAF is defined as BAF(A) = 15 50 j 1 j j th max 15x50 image in the dataset. j=1 i=1 F (Ii , A(I , p )i ) where I denotes the j p ∈S
Results in Table 2 show that our algorithm outperforms the others even when the input parameter is fixed and the same for all images. In fact, the discrepancy between the performances of our method and the mean-shift’s is larger in BAF results (0.70 to 0.62) than the BAFPI results (0.74 to 0.70), which suggests that mean-shift’s input parameters are more image-dependent. Finally, we give some of the segmentation results (in Fig.7) obtained by the best parameters found by the BAF measure. (The parameters are: graph-based algorithm: σ = 0.5, k = 1000, a = 10, mean-shift: σs = 5, σr = 6, a = 10, our method σ = 18).
From Ramp Discontinuities to Segmentation Tree
133
Table 2. Best average F-measure (BAF) results
Graph-based Mean-shift Our method
Precision 0.56 0.61 0.67
Recall F-measure (BAF) 0.81 0.57 0.87 0.62 0.87 0.70
Fig. 7. Segmentation results. First column: input image, second column: output of graph-based method, third column: output of mean-shift based method, fourth column: our method. See text for details on input parameters.
In the supplementary material of the paper, we provide a graphical user interface to browse the segmentation trees of images. Finally, we want to note that our algorithm is not designed for texture segmentation. If run on texture images, it would only segment the locally homogeneous regions, corresponding to various levels of detail of the texture.
4 Conclusions We presented a new algorithm for low-level multiscale image segmentation. The algorithm is capable of detecting low-level image structures having arbitrary shapes, at
134
E. Akbas and N. Ahuja
arbitrary homogeneity levels. A low-level image structure, or region, is defined as a connected set of pixels surrounded by ramp discontinuities. To detect regions, the image is converted to a ramp magnitude map. We name this conversion as the ramp transform. Then we find the basins of the ramp magnitude map, and consequently obtain region seeds. These seeds are grown by a relaxation labeling procedure to get the final segmentation. After this, we obtain multi-photometric scale segmentation by doing a multiscale analysis over a range of scales where, at each scale, boundary fragments having less contrast than the current scale are removed. This process guarantees a strict hierarchy. Using this property, we arrange the regions into a tree data structure. For empirical study, we created a new low-level segmentation dataset and showed that our algorithm outperforms two widely used segmentation algorithms. Acknowledgments. The support of the Office of Naval Research under grant N0001409-1-0017 and the National Science Foundation under grant IIS 08-12188 is gratefully acknowledged.
References 1. Ahuja, N.: A transform for multiscale image segmentation by integrated edge and region detection. IEEE Trans. Pattern Anal. Mach. Intell. 18(12), 1211–1235 (1996) 2. Arora, H., Ahuja, N.: Analysis of ramp discontinuity model for multiscale image segmentation. In: ICPR 2006: Int’l Conf. on Pattern Recog., pp. 99–103 (2006) 3. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002) 4. Comaniciu, D., Ramesh, V., Meer, P.: The variable bandwidth mean shift and data-driven scale selection. In: ICCV, pp. 438–445 (2001) 5. Elder, J.H., Zucker, S.W.: Local scale control for edge detection and blur estimation. IEEE Trans. Pattern Anal. Mach. Intell. 20(7), 699–716 (1998) 6. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vision 59(2), 167–181 (2004) 7. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(11), 721–741 (1984) 8. Haralick, R.M., Shapiro, L.G.: Survey- image segmentation techniques. Computer Vision Graphics and Image Processing 29, 100–132 (1985) 9. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int’l Conf. Computer Vision, July 2001, vol. 2, pp. 416–423 (2001) 10. Martin, D.R., Fowlkes, C.C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 26(5), 530– 549 (2004) 11. Meyer, F., Beucher, S.: Morphological segmentation. J. Vis. Comm. Image Represent, 21– 46 (1990) 12. Rosenfeld, A., Hummel, R.A., Zucker, S.W.: Scene labeling by relaxation operations. IEEE Transactions on Systems, Man and Cybernetics 6(6), 420–433 (1976) 13. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000) 14. Xu, C., Prince, J.L.: Snakes, shapes, and gradient vector flow. IEEE Transactions on Image Processing 7(3), 359–369 (1998) 15. Zabih, R., Kolmogorov, V.: Spatially coherent clustering using graph cuts. In: CVPR 2004: Int’l Conf. on Computer Vision and Pattern Recog., pp. 437–444 (2004)
Natural Image Segmentation with Adaptive Texture and Boundary Encoding Shankar R. Rao1 , Hossein Mobahi1 , Allen Y. Yang2 , S. Shankar Sastry2 , and Yi Ma1,3 1
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign {srrao,hmobahi2,yima}@illinois.edu 2 EECS Department, University of California, Berkeley {yang,sastry}@eecs.berkeley.edu 3 Visual Computing Group, Microsoft Research Asia, Beijing, China
Abstract. We present a novel algorithm for unsupervised segmentation of natural images that harnesses the principle of minimum description length (MDL). Our method is based on observations that a homogeneously textured region of a natural image can be well modeled by a Gaussian distribution and the region boundary can be effectively coded by an adaptive chain code. The optimal segmentation of an image is the one that gives the shortest coding length for encoding all textures and boundaries in the image, and is obtained via an agglomerative clustering process applied to a hierarchy of decreasing window sizes. The optimal segmentation also provides an accurate estimate of the overall coding length and hence the true entropy of the image. Our algorithm achieves state-of-the-art results on the Berkeley Segmentation Dataset compared to other popular methods.
1
Introduction
The task of partitioning a natural image into regions with homogeneous texture, commonly referred to as image segmentation, is often a crucial first step for high-level image understanding, significantly reducing the complexity of content analysis of images. Image segmentation and its higher-level applications are largely designed to emulate functionalities of human visual perception (e.g., object recognition and scene understanding), and hence dominant criteria for measuring segmentation performance are based on qualitative and quantitative comparisons with human segmentation results. In the literature, investigators have explored several important models and principles that can lead to good image segmentation: 1. Different texture regions of a natural image admit a mixture model [1]. For example, Normalized Cuts (NC) [2] and F&H [3] formulate the segmentation as a graph-cut problem, while Mean Shift (MS) [4] seeks a partition of a color image based on different modes within the estimated empirical distribution.
This work is partially supported by NSF CAREER IIS-0347456, ONR YIP N0001405-1-0633, and ARO MURI W911NF-06-1-0076.
H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 135–146, 2010. c Springer-Verlag Berlin Heidelberg 2010
136
S.R. Rao et al.
2. Region contours/edges convey important information about the saliency of the objects in the image and their shapes [5,6,7]. Several recent methods have been proposed to combine the cues of homogeneous color and texture with the cue of contours in the segmentation process [8,9]. 3. The properties of local features (including texture and edges) usually do not share the same level of homogeneity at the same spatial scale. Thus, salient image regions can only be extracted from a hierarchy of image features under multiple resolutions [10,11,12]. Despite much work in this area, good image segmentation remains elusive to obtain for practitioners, mainly for the following two reasons: 1. There is little consensus on what criteria should be used to evaluate the quality of image segmentations. It is difficult to strike a good balance between objective measures that depend solely on the intrinsic statistics of imagery data and subjective measures that try to empirically model human perception. 2. In the search for objective measures, there has been a lack of consensus on good models for a unified representation of image segments including both their textures and contours. Recently an objective metric based on the notion of lossy minimum description length (MDL) has been proposed for evaluating clustering of general mixed data [13]. The basic idea is that, given a potentially mixed data set, the “optimal segmentation” is the one that, over all possible segmentations, minimizes the coding length of the data, subject to a given quantization error. For data drawn from a mixture of Gaussians, the optimal segmentation can often be found efficiently using an agglomerative clustering approach. The MDL principle and the new clustering method have later been applied to the segmentation of natural images, known as compression-based texture merging (CTM) [12]. Preliminary success of this approach leads to the following important question: To what extent is segmentation obtained by image compression consistent with human perception? However, although the CTM method utilizes the idea of data compression, it does not exactly seek to compress the image per se. First, it “compresses” feature vectors or windows extracted around all pixels by grouping them into clusters as a mixture of Gaussian models. As a result, the final coding length is highly redundant due to severe overlap between windows of adjacent pixels, and has no direct relation to the true entropy of the image. Second, the segmentation result encodes the membership of pixels using a Huffman code that does not taking into account of spatial adjacency of pixels nor smoothness of boundaries. Thus, CTM does not give a good estimate of the true entropy of the image and its success cannot be used to justify a strong connection between image segmentation and image compression. Contributions. In this paper, we contend that, much better segmentation results can be obtained if we follow more closely the principle of image compression, by correctly counting only the necessary bits needed to encode a natural image for both the texture and boundaries. The proposed algorithm precisely estimates the coding length needed to encode the texture of each region based on the rate distortion of its probabilistic distribution and the number of non-overlapping windows inside. In order to adapt to the different scales and shapes of texture regions in the image, a hierarchy of multiple window sizes is incorporated in the
Natural Image Segmentation with Adaptive Texture and Boundary Encoding
137
segmentation process. The algorithm further encodes the boundary information of each homogeneous texture region by carefully counting the number of bits needed to encode the boundary with an adaptive chain code. Based on the MDL principle, the optimal segmentation of an image is defined as the one that minimizes its total coding length, in this case a close approximation to the true entropy of the image. With any fixed quantization, the final coding length gives a purely objective measure for how good the segmentation is in terms of the level of image compression. We conduct extensive experiments to compare the results with human segmentation, using the Berkeley Segmentation Dataset (BSD) [14]. Although our method is conceptually simple and the measure used is purely objective, the segmentation results match extremely well with those by human, exceeding or competing with the best segmentation algorithms. The source code of our algorithm, as well as a detailed report of the segmentation results are available online at http://perception.csl.illinois.edu/ coding/image segmentation/.
2
Adaptive Texture and Boundary Encoding
In this section, we present a unified information-theoretic framework to encode both the texture and boundary information of a natural image. The implementation of the algorithm for adaptive image segmentation and the experiments to validate its performance will be presented in Sections 3 and 4. 2.1
Constructing Texture Vectors
We discuss how to construct texture vectors that represent homogeneous textures in image segments. Given an image in RGB format, we convert it to the L∗ a∗ b∗ color space. It has been noted in the literature that such a color metric better approximates the perceptually uniform color space. In order to capture the variation of a local texton, one can directly apply a w × w cut-off window around a pixel across the three color channels, and stack the color values inside the window in a vector form as in [12]. Figure 1 (left) illustrates our process for constructing features. Let the wneighborhood Ww (p) be the set of all pixels in a w × w window centered at pixel p. We construct a set of features X by taking the w-neighborhood around each pixel in I, and then stacking each window as a column vector: 2 . X = {xp ∈ R3w : xp = Ww (p)S for p ∈ I}.
(1)
For ease of computation, we reduce the dimensionality of these features by projecting the set of all features X onto their first D principal components. We ˆ We have observed denote the set of features with reduced dimensionality as X. that for many natural images, the first eight principal components of X contain over 99% of the energy. In this paper, we choose to assign D = 8. Over the years, there have been many proposed methods to model the representation of image textures in natural images. One model that has been shown
138
S.R. Rao et al.
Fig. 1. Left: We construct features by stacking w × w windows around all pixels of a L∗ a∗ b∗ image I into a data matrix X and then using PCA. Right: (color) KL divergence of RGB and L∗ a∗ b∗ windows from a true Gaussian distribution.
to be successful in encoding textures both empirically and theoretically is the Gaussian Mesh Markov Model (MMM) [15]. Particularly in texture synthesis, the Gaussian MMM provides consistent estimates of the joint distribution of the pixels in a window, which then can be used to fill in missing texture patches via a simple nonparametric scheme [16]. However, to determine the optimal compression rate for samples from a distribution, one must know the rate-distortion function of that distribution [12]. Unfortunately, the rate-distortion function for MMMs is, to our knowledge, not known in closed form, and difficult to estimate empirically. Over all distributions with the same variance, it is known that the Gaussian distribution will have the highest rate-distortion, and is in this sense, the worst case distribution for compression. Thus by using the rate-distortion for a Gaussian distribution, we obtain an upper bound for the true coding length of the MMM. In the following, we provide an empirical experiment to determine in which color space (RGB or L∗ a∗ b∗ ) feature windows from a region with homogeneous texture are better fit by a Gaussian distribution. We use as the ground truth training images from the BSD that were manually segmented by humans. Given the feature vectors within each region, we model the distribution both parametrically and non-parametrically. The parametric model Q is a multivariate normal distribution whose parameters are estimated from the samples using maximum likelihood. The non-parametric model P is obtained by kernel density estimation. If the true distribution is indeed normal, then P and Q should be very similar. Thus the KL divergence DKL (P Q) can be used to measure the non-Gaussianity of the distribution. The overall non-Gaussianity of each image is simply the average of the non-Gaussianity over all regions. We repeat the above procedure for the entire manually segmented image dataset and estimate the distribution of KL divergence by kernel density estimation for RGB and L∗ a∗ b∗ spaces. As Figure 1 (right) shows, between the two color metrics, L∗ a∗ b∗ has lower mean and standard deviation and hence is better modeled by a Gaussian distribution. 2.2
Adaptive Texture Encoding
We now describe encoding the texture vectors based on the lossy MDL principle. First, we consider a single region R with N pixels. Based on [12], for a fixed
Natural Image Segmentation with Adaptive Texture and Boundary Encoding
139
quantization error ε, the expected number of bits needed to code the set of N ˆ up to distortion ε2 is given by: feature windows X . ˆ = Lε (X)
D
2
log2 det(I + εD2 Σ) + N2 log 2 det(I + codebook data
D Σ) + D 2 ε2
2
log2 (1 + µ 2 ), ε mean
(2)
ˆ where μ and Σ are the mean and covariance of the feature windows in X. Equation (2) is the sum of three coding-lengths: the D Gaussian principal vectors as the codebook, the N windows w.r.t. that codebook, and the mean of the Gaussian distribution. The coding length function (2) is uniquely determined by the mean and covariance (μ, Σ). To estimate them empirically, we need to exclude the windows that cross the boundary of R (as shown in Figure 2). Such windows contain textures from the adjacent regions, which cannot be well modeled by a single ˆ w and covariance Gaussian as the interior windows. Hence, the empirical mean μ ˆw of R are only estimated from the interior of R: Σ . Iw (R) = {p ∈ R : q ∈ R, ∀q ∈ Ww (p)}. (3)
ˆw ˆ w and Σ Fig. 2. Only windows from the interior of a region are used to compute μ
ˆ to represent region R Furthermore, in (2), encoding all texture vectors in X is highly redundant because the N windows overlap with each other. Thus, to obtain an efficient code of R that closely approximates its true entropy, we only need to code the nonoverlapping windows that can tile R as a grid. Ideally, if R is a rectangular region of size mw × nw, where m and n are positive integers, then clearly we can tile R with exactly mn = wN2 windows. So for coding the region R, (2) becomes: . Lw,ε (R) = ( D 2 +
N 2w 2 ) log2
det(I +
D ˆ ε2 Σw )
+
D 2
log2 (1 +
ˆ w 2 µ ε2 ).
(4)
Real regions in natural images normally do not have such nice rectangular shapes. However, (4) remains a good approximation to the actual coding length of a region R with relatively smooth boundaries.1 1
We briefly explain why (4) is a good approximation for the entropy of a large region with smooth boundary. In this case, the number of boundary-crossing windows is much smaller than the number of interior windows. The average coding length of boundary-crossing windows is then roughly proportional to the number pixels inside the region if the Gaussian distribution is sufficiently isotropic.
140
2.3
S.R. Rao et al.
Adaptive Boundary Encoding
To code windows from multiple regions in an image, one must know to which region each window belongs, so that each window can be decoded w.r.t. the correct codebook. For generic samples from multiple classes, one can estimate the distribution of each class label and then code the membership of the samples using a scheme that is asymptotically optimal for that class distribution (i.e., the Huffman code used in [12]). Such coding schemes are highly inefficient for natural image segmentation, as they do not leverage the spatial correlation of pixels in the same region. In fact, for our application, pixels from the same region form a connected component. Thus, a more efficient way of coding group membership for regions in images is to code the boundary of the region containing the pixels. A well-known scheme for representing boundaries of image regions is the Freeman chain code. In this coding scheme, the orientation of an edge is quantized along 8 discrete directions, shown in Figure 3 (left). Let {ot }Tt=1 denote the orientations of the T boundary edges of R. Since each chain code can be encoded using three bits, the coding length of the boundary of R is B(R) = 3
7
#(ot = i).
(5)
i=0
The coding length B(R) can be further improved by using an adaptive Huffman code that leverages the prior distribution of the chain codes. Though the distribution of chain codes is essentially uniform in most images, for regions with smooth boundaries, we expect that the orientations of consecutive edges are similar, and so consecutive chain codes will not differ by much. Given an initial orientation (expressed in chain code) ot , the difference chain code of the . following orientation ot+1 is Δot = mod (ot − ot+1 , 8). Figure 3 compares the original Freeman chain code with the difference chain code for representing the boundary of a region. Notice for this region, the difference encoding uses only half of the possible codes, with most being zeroes, while the Freeman encoding uses all eight chain codes. Given the prior distribution P [Δo] of difference chain 0
2 1 ↑ 4←•→0 ↓ 5 6 7
0 0
0
0 5
1
3
0
2
5 4 6
2
6
2
6 7 0
3
4
0
7
3 4
4
4
0
0
4
0 1 1
0
7
3
0
1 0 1
7
0
4
0
0
7
4
0
5
0
5
1
0 0
0
7
0
0 7
0
0
7 0
0
0
0
5
Fig. 3. Left: The Freeman chain code of an edge orientation along 8 possible directions. Middle: Representation of the boundary of a region in an image w.r.t. the Freeman chain code. Right: Representation w.r.t the difference chain code.
Natural Image Segmentation with Adaptive Texture and Boundary Encoding
141
Table 1. The prior probability of the difference chain codes estimated from the BSD and by Liu and Zalik [17] Difference Code 0 1 2 3 4 5 6 7 Angle change 0◦ 45◦ 90◦ 135◦ 180◦ −135◦ −90◦ −45◦ Probability (BSD) 0.585 0.190 0.020 0.000 0.002 0.003 0.031 0.169 Probability (Liu-Zalik) 0.453 0.244 0.022 0.006 0.003 0.006 0.022 0.244
codes, B(R) can be encoded more efficiently using a lossless Huffman coding scheme: 7 B(R) = − #(Δot = i) log2 (P [Δo = i]). (6) i=0
For natural images, we estimate P [Δo] using natural images from the BSD that were manually segmented by humans. We compare our distribution with one estimated by Liu and Zalik [17], who used 1000 images of curves, contour patterns and shapes obtained from the web. As seen in Table 1, over 58% of the difference chain codes obtained from manual segmentation of the BSD are zeroes, corresponding to no change in angle along the boundary. Thus, regions of natural images tend to have smoother boundaries when segmented by humans.
3
Image Segmentation Algorithm
In this section, we show how to use the coding length functions we developed in Section 2 to construct a better compression-based image segmentation algorithm. We describe the basic approach below, and then propose a hierarchical scheme to deal with small and/or thin regions. 3.1
Segmentation by Minimizing Coding Length
Suppose an image I can be segmented into non-overlapping regions R = {R1 , . . . , Rk }, ∪ki=1 Ri = I. The total coding length of the image I is k
. LSw,ε (R) = Lw,ε (Ri ) + 12 B(Ri ).
(7)
i=1
Here, the boundary term is scaled by a half because we only need to represent the boundary between any two regions once. The optimal segmentation of I is the one that minimizes (7). Finding this optimal segmentation is, in general, a combinatorial task, but we can often do so using an agglomerative process. To initialize the optimization process, similar to [12], we utilize an oversegmentation step to initialize the optimization by superpixels.2 A superpixel is a 2
One could assume each pixel (and its windowed texture vector) belongs to a group of its own. However, in this case, texture windows of any size greater than one will (initially) intersect other adjacent regions (i.e., other neighboring pixels).
142
S.R. Rao et al.
small region in the image that does not contain strong edges in its interior. Superpixels provide a coarser quantization of an image than the underlying pixels, while respecting strong edges between the adjacent homogeneous regions. There are several methods that can be used to obtain a superpixel initialization, including those of Mori et al. [18], Felzenszwalb and Huttenlocher [3], and Ren et al. [11]. We have compared the three methods in the experiment and found that [18] works well for our purposes. Given an oversegmentation of the image, at each iteration, we find the pair of regions Ri and Rj that will maximally decrease (7) if merged: (Ri∗ , Rj∗ ) = argmax ΔLw,ε (Ri , Rj ), Ri ,Rj ∈R
where
. S ΔLw,ε (Ri , Rj ) = LS w,ε (R) − Lw,ε ((R\{Ri , Rj }) ∪ {Ri ∪ Rj }) = Lw,ε (Ri ) + Lw,ε (Rj ) − Lw,ε (Ri ∪ Rj ) + 12 (B(Ri ) + B(Rj ) − B(Ri ∪ Rj )).
(8)
ΔLw,ε (Ri , Rj ) essentially captures the difference in the lossy coding lengths of the texture regions Ri and Rj and their boundaries before and after the merging. If ΔL > 0, we merge Ri∗ and Rj∗ into one region, and repeat this process, continuing until the coding length LSw,ε (R) can not be further reduced. To model the spatial locality of textures, we further construct a region adjacency graph (RAG): G = (V, E). Each vertex vi ∈ V corresponds to region Ri ∈ R, and the edge eij ∈ E is present if and only if regions Ri and Rj are adjacent in the image. To perform image segmentation, we simply apply a constrained version of the above agglomerative procedure – only merging regions that are adjacent in the image. 3.2
A Hierarchical Implementation
The above region-merging scheme is based on the assumption of a fixed texture window size, and clearly cannot effectively deal with regions or superpixels that are very small and/or thin. In such cases, the majority or all texture windows will intersect with the boundary of the regions. We say that a region R is degenerate w.r.t. window size w if Iw (R) = ∅. For such a region, the w-neighborhoods of all ˆ cannot be reliably ˆ and Σ pixels will contain pixels from other regions, and so μ estimated. These regions are degenerate precisely because of the window size; for any w-degenerate region R, there is 1 ≤ w < w such that Iw (R) = ∅. We say that R is marginally nondegenerate w.r.t. window size w if Iw (R) =∅ and Iw+2 (R) = ∅. To deal with these degenerate regions, we propose to use a hierarchy of window sizes. Starting from the largest window size, we recursively apply the above scheme with ever smaller window sizes till all degenerate regions have been merged with their adjacent ones. In this paper, we start from 7×7 and reduce to 5 × 5, 3 × 3, and 1 × 1. For segmentation at smaller windows sizes, our scheme only allows adjacent regions to be merged if at least one of the regions is marginally nondegenerate. Notice that at a fixed window size, the region-merging process is similar to the CTM approach proposed in [12]. Nevertheless, the new coding length function
Natural Image Segmentation with Adaptive Texture and Boundary Encoding
143
and hierarchical implementation give much more accurate approximation to the true image entropy and hence lead to much better segmentation results (see Section 4). We refer to our overall algorithm as Texture and Boundary Encodingbased Segmentation (TBES).3 On a Quad-Core Intel Xeon 2.5GHz machine, the superpixel initialization using the method of [18] takes roughly five minutes and our MATLAB implementation of TBES takes about ten minutes per image.
4
Experiments
In this section, we conduct extensive evaluation to validate the performance of our method. We first describe our experimental setup, and then show both qualitative and quantitative results on images from the publicly available Berkeley Segmentation Dataset (BSD).4 To obtain quantitative evaluation of the performance of our method we use three metrics for comparing pairs of image segmentations: the probabilistic Rand index (PRI) [20], the variation of information (VOI) [21], and the precision and recall of boundary pixels [6].5 In cases where we have multiple ground-truth segmentations, to compute a given metric for a test segmentation, we simply average the results of the metric between the test segmentation and each groundtruth segmentation. With multiple ground-truth segmentations for an image we can also estimate the human performance w.r.t. these metrics by treating each ground-truth segmentation as a test segmentation and computing the metrics w.r.t. the other ground-truth segmentations.
(a) Originals
(b) ε = 50
(c) ε = 150
(d) ε = 400
Fig. 4. (color) Results of TBES on two images for varying choices of ε
To apply TBES to a natural image, we must choose the quantization level ε. As Figure 4 shows, a given image can have multiple plausible segmentations. We seek to find ε∗ that tends to best match with human segmentation. To determine ε∗ we run TBES on each of the 100 test images in BSD for sixteen choices of ε ranging from ε = 25 to ε = 400. We then choose ε that obtains the best average performance w.r.t. the various metrics. In our experiments we found that the choice of ε∗ = 150 results in the best balance between the PRI and VOI metrics, so for all our subsequent experiments, we use this choice of ε. 3 4 5
Detailed pseudocode for our TBES algorithm is available in our technical report [19]. Please refer to our website for segmentation results on all images of the BSD, as well as additional experiments on the MSRC Object Recognition Database. We use the harmonic mean of precision and recall, known as the global F-measure, as a useful summary score for boundary precision and recall.
144
S.R. Rao et al. Table 2. Comparison of PRI and VOI for various algorithms on the BSD Index / Method Human TBES (ε = 150) CTM MS NC UCM F&H PRI (Higher is better) 0.87 0.80 0.76 0.78 0.75 0.77 0.77 VOI (Lower is better) 1.16 1.76 2.02 1.83 2.18 2.11 2.15
Fig. 5. (color) Precision vs. Recall of boundaries on the BSD. The green lines are level sets of the F-measure, ranging from 0.1 (lower left) to 0.9 (upper right). Our method (TBES) is closest to human performance (brown dot), achieving an F-measure of 0.645.
The Berkeley Segmentation Dataset consists of 300 natural images, each of which has been hand segmented by multiple human subjects. Figure 6 illustrates some representative segmentation results. We compare the performance of our method to five publicly available image segmentation methods, which we refer to as “CTM” [12], “MS” [4], “NC” [2], “UCM” [6], and “F&H” [3], respectively. Table 2 and Figure 5 summarize the performance of our method based on the various metrics for the BSD: the indices PRI and VOI in Table 2 are used to evaluate goodness of the regions; and the precision-recall curve and F-measure in Figure 5 evaluate the segmentation boundaries. Notice that in Table 2, for both indices, our method achieves the best performance compared to all popular segmentation methods. It is also surprising that it does so with a fixed ε whereas CTM needs to rely on a heuristic adaptive scheme. For our segmentation results, if we could choose the best ε adaptively for each image to optimize the PRI index, the average PRI over the entire database would become 0.849; similarly for the VOI index, using the best ε for each image brings this index down to 1.466, both strikingly close to that of human segmentation. This suggests there is still plenty of room to improve our method by designing better schemes for choosing ε adaptively.
Natural Image Segmentation with Adaptive Texture and Boundary Encoding
(a) Animals
(b) Buildings
(c) Homes
(d) Nature
145
(e) People
Fig. 6. (color) Qualitative results of our algorithm on various kinds of images from BSD with a fixed ε = 150. For each result, the top is the original image, and the bottom is a segmentation image where each region is colored by its mean color.
5
Conclusion
We have proposed a novel method for natural image segmentation. The algorithm uses a principled information-theoretic approach to combine cues of image texture and boundaries. In particular, the texture and boundary information of each texture region is encoded using a Gaussian distribution and adaptive chain code, respectively. The partitioning of the image is sought to achieve the maximum lossy compression using a hierarchy of window sizes. Our experiments have validated that this purely objective and simple criterion achieves state-of-the-art segmentation results on a publicly available image database, both qualitatively and quantitatively.
References 1. Leclerc, Y.: Constructing Simple Stable Descriptions for Image Partitioning. IJCV 3, 73–102 (1989) 2. Shi, J., Malik, J.: Normalized cuts and image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (1997) 3. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. International Journal of Computer Vision (IJCV) 59(2), 167–181 (2004) 4. Comanicu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 24, 603–619 (2002) 5. Fua, P., Hanson, A.J.: An Optimization Framework for Feature Extraction. Machine Vision and Applications 4, 59–87 (1991) 6. Arbelaez, P.: Boundary extraction in natural images using ultrametric contour maps. In: Workshop on Perceptual Organization in Computer Vision (2006)
146
S.R. Rao et al.
7. Ren, X., Fowlkes, C., Malik, J.: Learning probabilistic models for contour completion in natural images. IJCV 77, 47–63 (2008) 8. Tu, Z., Zhu, S.: Image segmentation by data-driven Markov Chain Monte Carlo. PAMI 24(5), 657–673 (2002) 9. Kim, J., Fisher, J., Yezzi, A., Cetin, M., Willsky, A.: A nonparametric statistical method for image segmentation using information theory and curve evolution. PAMI 14(10), 1486–1502 (2005) 10. Yu, S.: Segmentation induced by scale invariance. In: CVPR (2005) 11. Ren, X., Fowlkes, C., Malik, J.: Scale-invariant contour completion using condition random fields. In: ICCV (2005) 12. Yang, A., Wright, J., Ma, Y., Sastry, S.: Unsupervised segmentation of natural images via lossy data compression. Computer Vision and Image Understanding 110(2), 212–225 (2008) 13. Ma, Y., Derksen, H., Hong, W., Wright, J.: Segmentation of multivariate mixed data via lossy coding and compression. PAMI 29(9), 1546–1562 (2007) 14. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In: ICCV (2001) 15. Levina, E., Bickel, P.J.: Texture synthesis and non-parametric resampling of random fields. Annals of Statistics 34(4), 1751–1773 (2006) 16. Efros, A., Leung, T.: Texture synthesis by non-parametric sampling. In: ICCV (1999) 17. Liu, Y.K., Zalik, B.: Efficient chain code with Huffman coding. Pattern Recognition 38(4), 553–557 (2005) 18. Mori, G., Ren, X., Efros, A., Malik, J.: Recovering human body configurations: combining segmentation and recognition. In: CVPR (2004) 19. Rao, S., Mobahi, H., Yang, A., Sastry, S., Ma, Y.: Natural image segmentation with adaptive texture and boundary encoding. Technical Report UILU-ENG-092211 DC-244, UIUC (2009) 20. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971) 21. Meila, M.: Comparing clusterings: An axiomatic view. In: Proceedings of the International Conference on Machine Learning (2005)
Gradient Vector Flow over Manifold for Active Contours Shaopei Lu and Yuanquan Wang Tianjin Key Lab of Intelligent Computing and Novel Software Technology, Tianjin University of Technology, Tianjin 300191, P.R. China
Abstract. The gradient vector flow (GVF) snake shows high performance at concavity convergence and initialization insensitivity, but the two components of GVF field are treated isolatedly during diffusion, this leads to the failure of GVF snake at weak edge preserving and deep and narrow concavity convergence. In this study, a novel external force for active contours named gradient vector flow over manifold (GVFOM) is proposed that couples the two components during diffusion by generalizing the Laplacian operator from flat space to manifold. The specific operator is Beltrami operator. This proposed GVFOM snake has been assessed on synthetic and real images; experimental results show that the GVFOM snake behaves similarly to the GVF snake in terms of capture range enlarging, initialization insensitivity, while provides much better results than GVF snake for weak edge preserving, objects separation, narrow and deep concavity convergence. Keywords: image segmentation, active contour, gradient vector flow, manifold, beltrami operator, gradient vector flow over manifold.
1
Introduction
Active contours or Snakes, proposed by Kass [1] in 1988, are curves defined within an image domain that move under the influence of internal forces and external forces. They have been one of the most influential ideas in computer vision. Broadly speaking, active contours can be classified into two categories: parametric active contours [1] and geometric active contours [2,3]. In this study, we focus on the parametric active contours, although the proposed approach can also be integrated into geometric active contours as in [3]. Active contours are powerful tools for image segmentation and motion tracking, and there have been a large number of variations proposed and applied. Since the external force drives the active contour toward objects, it plays leading role during evolution, and many novel methods for external force are proposed. Among all the methods, gradient vector flow (GVF), proposed by Xu and Prince [4,5], attracts the attention of many researches. The GVF external force extends the gradient vector further away from the edges so as to enlarge the capture range, and simultaneously suppresses the influence of noise. Since the impressive properties of GVF, this model has been intensively studied and widely employed H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 147–156, 2010. c Springer-Verlag Berlin Heidelberg 2010
148
S. Lu and Y. Wang
in computer vision applications; for example, Wang et al proposed the harmonic gradient vector flow (HGVF) [6] which could converge to narrow concavities of any depth; Hassouna and Farag presented an interesting application of GVF for skeleton extraction [7], and Ray and Acton utilized the GVF snake for cell tracking with many novelties [8,9]. It is obvious that the GVF field is one field with two components; most of the aforementioned models handled the two components isolatedly. As we know, the initial values of the two components of GVF come from one edge map; there should be some intrinsic relations between the two components. Intuitively, if the two components are coupled during diffusion, one could get much better results. Similar observations also exist for multichannel image processing. Sochen et al [10,11] proposed a novel framework for multichannel image smoothing, under which the image is treated as a Riemannian manifold embedded in a higher dimensional manifold, and the image is defined as embedding map between two manifolds, e.g., a two-dimensional manifold embedded in a five-dimensional Euclidean space for RGB color image. Great success with this framework has been reported for smoothing images selectively while preserving the multichannel edges [11,12]. This paper aims at reformulating the gradient vector flow under the framework proposed by Sochen et al [10]. Under this framework, the GVF is considered as a two-channel image, and further treated as a two-dimensional manifold embedded in a four-dimensional Euclidean space. We refer to this proposed method as Gradient Vector Flow Over Manifold, GVFOM in short. The GVFOM snake outperforms GVF snake in terms of weak edge preserving, closely neighbored objects separation, and convergence to narrow and deep concavities while maintaining the desirable properties of GVF snake, such as large capture range, Ushape concavity convergence, initialization insensitivity and noise robustness. To note, part of the preliminary results of this work have been reported in an abstract [13]. The remainder of this paper is organized as follows: In the next section, the snake model and GVF are briefly reviewed. We elaborate on the proposed GVFOM method in section 3. The experimental results and demonstrations are presented in section 4 and the conclusion is drawn in section 5.
2
Brief Review of GVF Snake
A Snake is defined as a curve c(s) = (x(s), x(s)),s ∈ [0, 1] that moves through the spatial domain of an image to minimize the following energy functional [1]: Esnake = 0
1
1 [ (α|c (s)|2 + β|c (s)|2 ) + Eext (c)]ds . 2
(1)
where c (s) and c (s) are first and second derivatives of c(s) with respect to s, α and β are weighting parameters that control the curve’s tension and rigidity, respectively. The external energy Eext is derived from the image data and takes
Gradient Vector Flow over Manifold for Active Contours
149
smallest values at the features of interest, such as boundaries. The typical external force for gray-value image is Eext = −|∇Gσ ⊗ I|2 , where Gσ is the Gaussian kernal of standard deviation σ. By calculus of variations, the Euler equation to minimize Esnake is αc (s) − βc (s) − ∇Eext = 0 .
(2)
This can be considered as a force balance equation: Fint + Fext = 0 .
(3)
where Fint = αc (s) − βc (s) and Fext = −∇Eext . The internal force Fint makes the curve to be smooth while the external force Fext draws the curve to the desired features of the image. To solve the problem of limited capture range of the traditional external force, Xu and Prince [4,5] proposed a new one called Gradient Vector Flow (GVF). The GVF is a vector field v(x, y) = (u(x, y), v(x, y)), which is derived by minimizing the following functional: EGV F = μ|∇v|2 + |∇f |2 |v − ∇f |2 dxdy . (4) where f is edge map of an image, μ is a regularization parameter governing the tradeoff between the first term and the second term in the integrand (4). Using the calculus of variations, the corresponding Euler equations to minimize (4) are: μ∇2 u − (u − fx )(fx2 + fy2 ) = 0 μ∇2 v − (v − fy )(fx2 + fy2 ) = 0
.
(5)
where ∇2 is the Laplacian operator.
3
Gradient Vector Flow over Manifold
Although the GVF external force shows high performance at capture range enlarging and initialization insensitivity, even at U-shape concavity convergence, the GVF snake still fails to converge to narrow and deep concavity and would leak out around weak edges, especially neighbored by strong ones. To deal with these problems, we propose the GVFOM based on the framework proposed by Sochen et al [10], under which the GVFOM vector is considered as a two-dimensional manifold embedded in a four-dimensional Euclidian space. This way, the two components of GVFOM are coupled during diffusion and the GVFOM snake can dive into narrow and deep concavities and preserve weak edges. 3.1
Manifold Framework
Our description of the manifold framework follows the work of Sochen et al [10]. Suppose there is an n-dimensional manifold Σ with coordinates σ 1 , σ 2 , · · · , σ n
150
S. Lu and Y. Wang
embedded in an m-dimensional manifold M with coordinates X 1 , X 2 , · · · , X m , where m > n. The embedding map X : Σ → M is given explicitly by the m functions of n variables. X : (σ 1 , σ 2 , · · · , σ n ) → . (6) (X 1 (σ 1 , σ 2 , · · · , σ n ), X 2 (σ 1 , σ 2 , · · · , σ n ), · · · , X m (σ 1 , σ 2 , · · · , σ n )) The metric on Σ gξν is the so-called pullback of the metric on M hij and is given explicitly as follows: gξν = hij ∂ξ X i ∂ν X j . (7) where the Einstein summation convention is employed; identical indices that appear one up and one down are summed over. Then, the map has the following weight (Polyakov action function [14]), √ S[X i , gξν , hij ] = dm σ gg ξν ∂ξ X i ∂ν X j hij . (8) where m is the dimension of Σ, g is the determinant of metric matrix (gξν )Σ , g ξν is the inverse of the metric, so that g ξν gνγ = δγξ , where δγξ is the Kronecker delta. Using standard methods in variational calculus, the Euler-Lagrange equations with respect to the embedding are 1 δS 1 1 i √ √ − √ = √ ∂ξ ( gg ξν ∂ν X i )hil + √ Γjk ∂ξ X j ∂ν X k gg ξν hil . (9) 2 g δX l g g i where the Levi-Civita connection coefficients Γjk are zero when the embedding is in a Euclidean space with Cartesian coordinate system. The operator that is acting on X i in the first term of (9) is the natural generalization of the Laplacian from flat spaces to manifolds and is called the second order differential parameter of Beltrami [15], or in short Beltrami operator, and we will denote it by g . Since the different metric gξν on manifold, the Beltrami operator can couple all the channels of the vector images.
3.2
Gradient Vector Flow over Manifold
Following the manifold framework, the GVFOM vector (u(x, y), v(x, y)) is treated as a two-dimensional manifold embedded in a four-dimensional Euclidean space, denote the manifold by Σ and the Euclidean space by M , the embedding map X : Σ → M is given as follows: X : (σ 1 , σ 2 ) → (X 1 (σ 1 , σ 2 ) = σ 1 , X 2 (σ 1 , σ 2 ) = σ 2 , X 3 (σ 1 , σ 2 ) = u(σ 1 , σ 2 ), X 4 (σ 1 , σ 2 ) = v(σ 1 , σ 2 ))
.
(10)
If we denote σ 1 = x, σ 2 = y, it can be written as (x, y, u, v). Taking the metric of the embedding space as follows: ⎛ ⎞ 10 0 0 ⎜0 1 0 0 ⎟ ⎟ H=⎜ (11) ⎝ 0 0 λ2 0 ⎠ . 2 00 0 λ
Gradient Vector Flow over Manifold for Active Contours
151
where λ is a relative scale of u and v with respect to the spatial coordinates x and y. Generally, λ = 1 and the matrix is a fourth-order identity matrix. Then the elements in metric matrix of the manifold can be gained as follows: gξν = hij
∂X i ∂X j = hij ∂ξ X i ∂ν X j ; ξ, ν = 1, 2; i, j = 1, · · · , 4 . ∂σ ξ ∂σ ν
(12)
We rewrite (12) into matrix form: G=
1 + λ2 u2x + λ2 vx2 λ2 ux uy + λ2 vx vy λ2 ux uy + λ2 vx vy 1 + λ2 u2y + λ2 vy2
.
Therefore, the weight of map (10) is given by S = d2 σ 1 + λ2 |∇u|2 + λ2 |∇v|2 + λ4 (ux vy − uy vx )2 .
(13)
(14)
To note, the term ux vy − uy vx in (14) is a coupling term, and plays an important role in the proposed GVFOM method for it couples the two components in the vector field. The coefficient of coupling term is the biquadrate of λ, while the power of other coefficients is just 2, therefore, if the value of λ is larger than 1, the effects of coupling term would be more highlighted than others. Using standard methods in variational calculus to minimize functional (14) with respect to u and v, we can get Beltrami operator acting on u and v respectively as follows: 1 δS 1 √ = √ ∂ξ ( gg ξν ∂ν u) √ 2λ2 g δu g . 1 δS 1 √ g v = − 2 √ = √ ∂ξ ( gg ξν ∂ν v) 2λ g δv g
g u = −
(15)
By replacing the Laplacian in (5) with g , the diffusion equations for GVFOM read μg u − (u − fx )(fx2 + fy2 ) = 0 μg v − (v − fy )(fx2 + fy2 ) = 0
4
.
(16)
Experimental Results and Analysis
In this section, we demonstrate some desirable properties of the GVFOM snake, and the performance of GVF snake and GVFOM snake are compared, with particular emphasis on narrow and deep concavity convergence and weak edge preserving. The parameters for all snakes in our experiments are α = 0.1, β = 0.1, time step τ = 1. Regularization μ for GVF and GVFOM are identically set to 0.15.
152
4.1
S. Lu and Y. Wang
Capture Range, Concavity Convergence, and Initialization Insensitivity
We utilize the u-shape and room images, which are employed in [4], to verify the properties of GVFOM snake in capture range, concavity convergence and initialization insensitivity. Fig.1 shows the results on U-shape image with a set of far-off initializations. It can be seen from this experiment that the GVFOM snakes are able to converge to concave boundaries. Fig.2 demonstrates the results on room image with a set of initializations placed across the boundaries, the GVFOM snakes converge correctly to the objects, even stick to the subject contours.
(a)
(b)
(c)
Fig. 1. Convergence of GVFOM snakes with (a)λ = 1, (b)λ = 10, (c)λ = 20
(a)
(b)
(c)
Fig. 2. Convergence of GVFOM snakes with (a)λ = 1, (b)λ = 10, (c)λ = 20
4.2
Noise Robustness
To evaluate the noise robustness of GVFOM snake, we add impulse noise to ushape image which is shown in Fig.3 (a). Gaussian filter is employed to alleviate the impact of noise. In Fig.3 (b), the GVF snake fails to converge to concavity, and the GVFOM snake behaves similarly with λ = 10, see Fig.3(c). But when the value of λ is set to 20, the GVFOM snake successfully converges to concavity. We conclude that the GVFOM snake outperforms the GVF snake in noise resistance by weighting more information on the coupling of the two components of the vector field.
Gradient Vector Flow over Manifold for Active Contours
(a)
(b)
(c)
153
(d)
Fig. 3. (a) Original image with noise. (b) Convergence of GVF snake. Convergence of GVFOM snakes with (c)λ = 10, (d)λ = 20.
4.3
Weak Edge Preserving
In Fig.4, we construct a particular image to show the performance of GVFOM snake at weak edge preserving. In this image, there is a gap neighbored by a strong beeline. Fig.4 (a) shows that the GVF snake leaks out and converges incorrectly to the strong beeline; while in Fig.4 (b), the GVFOM snake succeeds in preserving the weak edge.
(a)
(b)
Fig. 4. (a) Convergence of GVF snake. (b) Convergence of GVFOM snake with λ = 20. In each panel, the left is the evolution of snake and the right is the force field.
4.4
Objects Separation
In this experiment, our task is to extract the gray disk in the image shown in Fig.5; there are just three pixels between disk and rectangle. In order to generate a large capture range such that a small circle can be used as initialization, the amount of diffusion should be large. So, there is a dilemma for the GVF and GVFOM snakes to separate the two objects and to have a far-off initialization. Our solution is to increase the value of λ to weigh more information on the coupling term in GVFOM. Fig.5 shows the results of GVF snake and GVFOM snake, likewise, the GVF snake fails while the GVFOM snake succeeds.
154
S. Lu and Y. Wang
(a)
(b)
Fig. 5. (a) Convergence of GVF snake. (b) Convergence of GVFOM snake with λ = 25. In each panel, the left is the evolution of snake and the right is the force field.
4.5
Narrow and Deep Concavity Convergence
It is well-known the GVF snake can converge to U-shape concavity, but when the concavity is very narrow and deep, the GVF snake would fail. In this subsection, we will demonstrate the success of GVFOM snake for this issue. Fig.6 shows a concavity of 3-pixel width and 30-pixel depth. When λ is set to 8 or larger value, the GVFOM snake dives into the concavity without any barrier.
(a)
(b)
Fig. 6. (a) Convergence of GVF snake. (b) Convergence of GVFOM snake with λ = 8. In each panel, the left is the evolution of snake and the right is the force field.
4.6
Real Image
As aforementioned, the parameter λ plays an important role in GVFOM, that is, large λ would be helpful to improve the performance of GVFOM. The parameter μ and iteration number for (5) are also closely related to preserving weak edges as known before. Small μ and iteration can preserve the weak edge, but the noise can’t be smoothed and the capture range is not large enough with this configuration. In this subsection, we employ large λ for real image segmentation which is shown in Fig.7 (a). In this experiment, μ is set to 0.1. The fourth-order PDEs [16] are employed for noise removal. The edge map is given in Fig.7 (b). Figs.7 (c) and (d) show the GVF fields at iterations 30 and 180, respectively. The results indicate that we can hardly smooth the noise in blood pool and preserve endocardium simultaneously with GVF snake. As for GVFOM snake, λ = 260
Gradient Vector Flow over Manifold for Active Contours
155
is employed, the GVFOM field within blood pool is not inferior to GVF field at iteration 180, while the endocardium is preserved and the snake contour can locate the endocardium successfully, see Figs.7(e) and (f).
(a)
(b)
(c)
(e)
(f)
(d)
Fig. 7. (a)Original image. (b)Edge map. GVF force field with iterations of (c) 30, (d) 180. (e) Convergence of GVFOM snake with λ = 260 and (f) the corresponding force field.
5
Conclusion
In this paper, we have introduced a novel external force for active contours, namely, gradient vector flow over manifold (GVFOM). The two components of GVFOM are treated as a two-dimensional manifold embedded in a fourdimensional Euclidian space, so that the two components are coupled during diffusion. The GVFOM snake possesses some desirable properties of GVF snake, such as large capture range, initialization insensitivity and U-shape convergence, but behaves much better than GVF snake in terms of weak edge preserving, objects separation and narrow and deep concavity convergence. The GVFOM snake is particularly powerful for complicated images and can be served as a superior alternative to the GVF snake. Acknowledgments. This work was supported by the national natural science foundation of China under grants 60602050, 60805004.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snake: active contour models. International Journal of Computer Vision 1, 321–331 (1988) 2. Caselles, V., Catte, F., Coll, T., Dibos, F.: A Geometric Model for Active Contours in Image Processing. Numerische Mathematik 66, 1–31 (1993)
156
S. Lu and Y. Wang
3. Paragios, N., Mellia-Gottardo, O., Ramesh, V.: Gradient vector flow fast geometric active contours. IEEE TPAMI 26, 402–407 (2004) 4. Xu, C., Prince, J.: Snakes, Shapes and gradient vector flow. IEEE TIP 7, 359–369 (1998) 5. Xu, C., Prince, J.: Generalized gradient vector flow external forces for active contours. Signal Processing 71, 131–139 (1998) 6. Wang, Y., Jia, Y., Liu, L.: Harmonic gradient vector flow external force for snake model. Electronics Letters 44, 105–106 (2008) 7. Farag, A., Hassouna, M.: Variational Curve Skeletons Using Gradient Vector Flow. In: IEEE TPAMI (2009) 8. Ray, N., Acton, S.T., Ley, K.: Tracking leukocytes in vivo with shape and size constrained active contours. IEEE TMI 21, 1222–1235 (2002) 9. Ray, N., Acton, S.T.: Motion gradient vector flow: an external force for tracking rolling leukocytes with shape and size constrained active contours. IEEE TMI 23, 1466–1478 (2004) 10. Sochen, N., Kimmel, R., Malladi, R.: A general framework for low level vision. IEEE TIP 7, 310–318 (1998) 11. Kimmel, R., Malladi, R., Sochen, N.: Images as embedded maps and minimal surfaces: movies, color, texture, and volumetric medical images. International Journal of Computer Vision 39, 111–129 (2000) 12. Sagiv, C., Sochen, N., Zeevi, Y.Y.: Integrated Active Contours for Texture segmentation. IEEE TIP 15, 1633–1646 (2006) 13. Lu, S., Wang, Y.: A Reformative Gradient Vector Flow Based on Beltrami Flow. Congress on Image and Signal Processing (2009) 14. Polyakov, A.M.: Quantum geometry of bosonic strings. Physics Letters B 103, 207–210 (1981) 15. Kreyszing, E.: Differential Geometry. Dover, New York (1991) 16. You, Y.L., Kaveh, M.: Fourth-order partial differential equations for noise removal. IEEE TIP 9, 1723–1730 (2000)
3D Motion Segmentation Using Intensity Trajectory Hua Yang1 , Greg Welch2 , Jan-Michael Frahm2 , and Marc Pollefeys2 1
2
Kitware, Inc. Computer Science Department, University of North Carolina at Chapel Hill
Abstract. Motion segmentation is a fundamental aspect of tracking in a scene with multiple moving objects. In this paper we present a novel approach to clustering individual image pixels associated with different 3D rigid motions. The basic idea is that the change of the intensity of a pixel can be locally approximated as a linear function of the motion of the corresponding imaged surface. To achieve appearance-based 3D motion segmentation we capture a sequence of local image samples at nearby poses, and assign for each pixel a vector that represents the intensity changes for that pixel over the sequence. We call this vector of intensity changes a pixel “intensity trajectory”. Similar to 2D feature trajectories, the intensity trajectories of pixels corresponding to the same motion span a local linear subspace. Thus the problem of motion segmentation can be cast as that of clustering local subspaces. We have tested this novel approach using some real image sequences. We present results that demonstrate the expected segmentation, even in some challenging cases.
1
Introduction
Motion segmentation has been an active research topic in recent years. Motivated by 2D motion estimation, in particular optical flow work, most of the early approaches to motion segmentation address the problem of segmenting pixels using dense 2D flow fields. For instance, Black and Anandan use robust statistics to handle discontinuities in the flow fields [1]. In layered approaches [2] [3], images are segmented into a set of layers. These methods work on image motion and can not be extended to accommodate 3D motion. Common approaches to 3D motion analysis segmentation are feature-based. They usually aim at clustering feature points according to their underlying motion. Early work includes applying robust statistic methods like RANSAC [4]. Pioneered by Costeira and Kanade’s work, multi-body factorization based methods have been proposed [5] [6] for segmenting independent affine motions. These algorithms use as input a matrix of 2D feature trajectories (sequences of image coordinates of feature points across multiple frames), then use algebraic factorization techniques to cluster the feature trajectories into groups with different motions. One issue with the factorization method is that it assumes independent motion. Recently, to address more complicated scenes that exhibit partially H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 157–168, 2010. c Springer-Verlag Berlin Heidelberg 2010
158
H. Yang et al.
dependent-motion, [7] [8] propose to solve motion segmentation by clustering the motion subspaces spanned by the feature trajectories. Salient feature is not the only visual cue for analyzing motion. Researchers have widely used dense appearance measurements for tracking 3D motion. Traditional 3D appearance-based methods usually assume a 3D texture mapped model of the target object that is acquired off-line [9] [10] [11] or on-line [12]. The region of interest in the image is precisely initialized by an external (usually manual) method and is assumed to be accurately predicted by projecting the 3D model into the image space using the estimated motion. In addition to 3D models, researchers have also explored acquiring parametric representation of the scene appearance directly from training image samples. For instance, Murase and Nayar proposed an eigenspace-based recognition method and demonstrated tracking 1D motion of a rigid object [13]. Deguchi applied a similar eigenspace representation to simultaneously track rigid motions of the target camera and object [14]. Most image-based approaches require a large number of training images. Recently, a differential approach has been proposed to tracking 3D camera motion in complicated scenes without any prior information [15]. However, it made the assumption of a static scene. We believe that by providing semantic information about the underlying scene motion, a dense appearance-based 3D motion segmentation method could be valuable to image-based methods. Moreover the model-based methods may also benefit from motion segmentation, as it provides an alternative to the manual initialization process to locate the target. Compared to the prosperous research in feature-based techniques, dense (perpixel) 3D motion segmentation is to a large extend unexplored. To our knowledge, only one effort has been made to address dense 3D motion segmentation [16]. In that approach, image regions are segmented using optical flow, or more exactly, covariance-weighted optical flow approximated using spatial and temporal intensity derivative measurements. To address the noisy flow estimate, the authors proposed to compute a covariance-weighted flow-field using intensity measurement, under the assumption of brightness constancy [17]. A covariance matrix flow-field matrix is formed by stacking row vectors of transformed 2D flows of all image regions across multiple frames. Motion-based segmentation is achieved by factorizing the covariance-weighted flow-field matrix into a motion matrix and a shape matrix. Then regions with same motion are grouped by computing and sorting a reduced row echelon form of the shape matrix. In this paper, we will present an approach to clustering individual image pixels associated with different 3D rigid motions. Similar to [16], our method is based on the observation that the image measurements captured from different perspectives across multiple frames span a linear subspace. However, instead of 2D flow fields we use the less noisy 1D pixel intensities as the input measurements. Specifically, we introduce the notion of the pixel intensity trajectory, a vector that represents the intensity changes of a specific pixel over multiple frames. Like the 2D feature trajectories, the intensity trajectories of pixels associated with the same motion span a low-dimensional linear subspace. We therefore formulate the problem of motion segmentation as that of clustering local subspaces. Unlike
3D Motion Segmentation Using Intensity Trajectory
159
the flow-based technique, this linear model of the intensity measurements does not require strict brightness constancy. As we will discuss later, it can be extended to accommodate more general cases, such as illumination changes on Lambertian surface under directional lighting. For segmenting motion subspaces, we apply spectral clustering to the intensity trajectories. This classification technique addresses some issues of direct matrix factorization, such as the noisesensitivity [18] and the difficulty in handling partially dependent motion [19].
2 2.1
Clustering Motion Subspaces Intensity Trajectory Matrix
We begin our discussion in scenes with constant uniform illumination. In this case, the image intensity can be represented as a function of the pose P of the corresponding imaged surface patch in the camera viewing space. Let P = [x, y, z, α, β, γ] represents the relative pose between the object and the camera. Let I(u, P ) be the image intensity, or a filtered version of the image intensity, of a pixel u = [ux , uy ] captured at a pose P . Using the brightness constancy equation, we can compute a local linearization of the intensity function using a Taylor expansion. Let dP be the 3D motion. If dP is small, namely image motion caused by dP is sub-pixel, the change of intensity dI can also be locally linearized as ∂I dI = I(u, P + dP ) − I(u, P ) = dP (1) ∂P Consider acquiring a reference image I0 at pose P0 , and a sequence of f images Ii at nearby poses Pi , and then computing f difference images dIi = Ii − I0 (i = 1, ..., f ). Next assign each pixel an f -vector that represents its intensity changes over the f different images corresponding to the motions dPi . We call the f -vectors of intensity changes dI = [dI1 , dI2 , ..., dIf ] pixel intensity trajectories (as oppose to the 2D feature trajectories). Next construct an intensity trajectory matrix W that combines the intensity trajectories of all n image pixels. The rows of W represent difference images, and its columns represent pixel intensity trajectories. dI1,1 ... dI1,n W = ... . . . ... dIf,1 ... dIf,n 2.2
Motion Subspaces
Consider a scene with a single 3D rigid motion. Using equation (1), W can be decomposed into two matrices: a motion matrix M of size f × 6 and an intensity Jacobian matrix F of size n × 6 as follows. W = MFT
(2)
160
H. Yang et al.
∂I ∂P 1,1 F = ... ∂I ∂P n,1
... .. . ...
dP1,1 .. , M = .. . . ∂I dPf,1 ∂P n,6 ∂I ∂P 1,6
... dP1,6 . .. . .. ... dPf,6
If the scene texture and the motion are non-degenerate, M and F are of rank 6. Thus the intensity trajectory matrix W is at most rank 6 (less for degenerate cases). In other words, the intensity trajectories of pixels associated with a single 3D rigid motion span a linear subspace, whose rank is less than or equal to 6. Now consider the intensity trajectory matrix when the scene contains k different motions. In this case, the image pixels belong to k groups. Each group corresponds to the scene surfaces undergoing the same motion. To demonstrate the structure of the W matrix, we assume a certain permutation matrix Λ such that W = |W1 , W2 , ..., Wk | Λ F1T T F 2 (3) = |M1 , M2 , ..., Mk | Λ . .. FkT Wi = Mi FiT
(i = 1, 2, ....k)
(4)
where Mi and Fi are the motion matrix and intensity Jacobian matrix for the i-th group, and Wi is the concatenation of pixel intensity trajectories of pixels in that group. Again rank(Wi ) ≤ 6. From Equations (3) and (4), we can see that the intensity trajectories captured in a scene with k rigid motions can be clustered into k groups, which span k linear subspaces of rank less than 6. This indicates that motion segmentation can be achieved through subspace clustering. 2.3
Motion Subspaces under Directional Illumination
The previous analysis assumes brightness constancy. In this section, we will show that such a constraint can be relaxed to accommodate scenes with Lambertian objects and constant directional light sources. Consider a scene that consists of m light sources with directions Li and magnitudes li (i = 1...m), and a 3D point p on a convex object with surface normal N and albedo λ. If we denote the incidence angle, the angle between the ray from light source i and the surface normal at p, to be θi , the intensity of p can be written as m m I= li λ max(Li · N, 0) = li λ max(cos θi , 0) (5) i=1
i=1
Denote the half cosine function as ki = max(cosθi , 0). Its derivative can be written as 1 ∂ki − sin θi − π2 < θi < π2 = (6) 0 otherwise ∂θi 1
i The partial derivative ∂k is unbounded at θi = 0. This discontinuity only affects ∂θi pixels lying exactly on the illumination silhouette. In practice, its effect is usually blurred out by the image low-pass filtering process (see Section 4).
3D Motion Segmentation Using Intensity Trajectory
161
Now let us consider the change of the intensity caused by the motions of the object and the camera. We denote object motion as dPo . Unlike the dP used in previous sections, dPo is defined in the world space. We begin our discussion by assuming a fixed camera. When the object motion consists of nonzero rotational components, the surface normal and the incidence angles will change accordingly. Denote the change of the incidence angle of light source i as dθi . For a small dPo and thus a small dθi , we can apply Taylor extension and represent the change of the pixel intensity dI as a linear function of dθi . m ∂ki dI = λ li dθi ∂θi i=1
(7)
From Equation (7), we can see that the change of intensity dI of point p is a linear function of dθi . For fixed distant light sources, the incidence angle θi (i=1...m) is determined by the surface normal N . Therefore, θi is function of N . Under small motion, we can approximate the change of incidence angle dθi as a linear function of the change of the surface normal dN . dθi = arccos(Li · (N + dN )) − arccos(Li · N ) 1 ≈−√ Li · dN = − sin1θi Li · dN 2 2
(8)
1−(Li ·N )
If θi is not zero, it is clear that dθi is a linear function of dN . Notice that θi is zero only when N and Li align with each other. In this case, dN is perpendicular to Li . Using a small angle approximation of sin θ = θ, we have dθi = sin dθi = dN . From Equation (7) and (8), we can see that dI is a linear function of dθi , which is a linear function of dN . Since dN is clearly a linear function of dPo , the change of intensity dI is a linear function of the object motion dPo . Under the small motion assumption, dN has only two degrees of freedom (on the plane perpendicular to N ). Thus the change of intensity dI caused by the change of illumination lies in a 2D subspace. For a fixed camera, the relative motion between the object and the camera dP is the same as dPo . Thus the 2D illumination subspace is embedded in the 6D motion subspace. In more general cases, where both the object and the camera move independently, dP is independent of dPo . The 2D illumination subspace and the 6D motion subspace are orthogonal. Therefore, for a scene with convex Lambertian objects and constant directional light sources, the intensity trajectories of pixels corresponding to the same underlying motion generally span a 8D subspace.
3
Motion Segmentation by Clustering Local Subspaces
We have discussed that given a number of local image samples captured at nearby poses (sub-pixel motion), one can construct an intensity trajectory matrix W . The 3D motion segmentation can be formulated as clustering columns of W with respect to their different underlying motion subspaces. The column clustering can be achieved by factorizing the measurement matrix [5] [6] [16]. However, matrix factorization requires the underlying motions to be
162
H. Yang et al.
independent [19], an assumption often violated in real environments. Recently, researchers have attempted to address partially-dependent motion. Most notably are Vidal and Hartley’s algebraic-based approach [7] and Yan and Pollefeys’s spectral-based approach [8]. A review can be found in [20]. We employ the so called Local Subspace Affinity (LSA) method for clustering motion subspaces [8]. The LSA algorithm is based on local linear projection and spectral clustering. Instead of working directly on the trajectory matrix W , LSA fits a local subspace for each point and constructs a similarity matrix A using the pairwise distances between the local subspaces. Motion segmentation is achieved by spectral clustering of the similarity matrix. The algorithm can be described in four steps: Step 1. Dimension reduction and data normalization: Remove redundant dimensions (usually contributed by noise) by projecting the trajectories from Rf onto a lower dimensional space Rl using SVD. Then normalize these l-vectors onto a unit hyper-sphere. Step 2. Local subspace estimation: For each projected point pi , find its nearest neighbors on the hyper-sphere (not from the image space) and compute a local linear subspace Si of dimension m. Step 3. Similarity matrix construction: Compute the distances (principle angles) between local subspaces, and construct a similarity matrix A, using Equation (9), where θijh is the h-th component of the principle angle vector between two local subspaces Si and Sj . m Aij = exp(− sin2 θijh ) (9) h=1
Step 4. Spectral clustering: Apply spectral clustering [21] to the similarity matrix A and segment data into k clusters, where k is the number of different rigid motions in the scene. In [8], the dimensions of the projected space l and the local subspace m are automatically determined using a rank detection algorithm to accommodate general unknown motion such as articulated or non-rigid motion. Since this paper only addresses 3D rigid motion, we choose l and m to be 6k and 6 for scenes with uniform lighting or 8k and 8 for directional lighting. There are two potential causes of segmentation error in the above algorithm. First, the neighbors selected in step 2 can be pixels of different subspaces. Second, the selected neighbors may not fully span the underlying motion subspace. In both cases, the local subspace tend to have similar distances to several motion subspaces, and misclassification may occur. To address these issues we have developed a refinement procedure (Step 5). In this procedure, we identify ambiguous pixels by comparing their distances to different motion subspaces, then reclassify them using the spatial continuity of the moving objects. Step 5a.1: For each cluster, compute a global motion subspace spanned by all the pixels belonging to it, using the result from step 4.
3D Motion Segmentation Using Intensity Trajectory
163
Step 5a.2: For each pixel, compute the pixel-to-cluster distance as the distance between its local subspace and its classified global subspace. Then for each cluster, compute the median of the in-cluster pixel-to-cluster distance. Step 5a.3: For each pixel compute the distances between its local subspace and all k global subspaces, normalized by the median in-cluster distance. Compute the ratio of the smallest and the second smallest normalized distances. Classify a pixel as an ambiguous-pixel if its ratio is bigger than a threshold (in all the experiments we set it to be 0.7). Step 5b: For each ambiguous pixel, search for its neighbors in the image space and classify it to the majority class.
4
Acquiring Local Appearance Samples
We have formulated the problem of motion segmentation as clustering linear subspaces spanned by pixel intensity trajectories. The pixel intensity trajectories are computed from a sequence of local image samples. In theory, to span a motion subspace of rank k, we need k + 1 image samples. This number is usually bigger in practice due to the noise issue. Since our subspace formulation is based on linearizing the local appearance manifold (see Equation (1)), the motion of the imaged surface across the sequence needs to be small (within the linear region). It is feasible to acquire sufficient local samples for normally moving objects using commodity imaging devices. First, we can use the common technique of blurring the original image to smooth the appearance manifold. The enlarged linear region can then accommodate larger motion. Secondly, the speed of commodity camera has become high enough to densely sample motion in most practical scenes. For instance, the Point-Gray Flea2 camera can capture at 80 framesper-second at VGA resolution. Moreover, for sampling 3D rigid motion under constant uniform illumination, the number of frames can be reduced by using a small-baseline camera cluster. This technique is based on the dual relationship between the camera motion and the object motion: under the brightness constancy assumption, images of an object captured at a specific pose from different perspectives can be considered as image samples of that object captured at different poses from the same perspective. A prototype of such a small baseline camera cluster is described in [15]. Commercial products are also available. An example is Point-Gray’s ProFUSION, a 5x5 camera array with 12mm spacing.
5
Experiment
To begin we used a camera cluster to capture some intensity trajectories. To do so we implemented a differential camera cluster similar to that used in [15]. Our camera cluster contains four small baseline Point-Gray Flea2 black-and-white cameras. At each frame time, we use the cluster to acquire seven local appearance samples. In addition to the four real samples from the physical cameras, we
164
H. Yang et al.
also generate three simulated images as in [15]. This is achieved by reprojecting one real image to three different synthetic image planes that are generated by rotating the image plane around its camera center. Simultaneously capturing multiple spatial samples helps to reduce the number of temporal frames. In addition, such a cluster setting can ensure the capture of the full 6D motion subspace for any object in the scene 2 , even if the underlying rigid motion of that object is degenerate within the sequence. Notice that while we use a camera cluster in some of our experiments, for the above reasons, our motion segmentation algorithm is general and is not restricted to a cluster setup. Section 5.2 and section 5.3 show two examples of segmentation using a single physical camera. Our cameras capture images at VGA resolution. However, to accommodate larger motion, we blurred the images to smooth the appearance manifold. In all the experiments, we used a Gaussian filter with σ = 12 to blur the original image and sub-sample the blurred image at a 20-to-1 rate. We ran our motion segmentation algorithm on the sub-sampled image. As a result of the sub-sampling, one pixel in the resulting segmented image corresponds to a 20×20 block in the original image. Note that because the cameras are packed closely in the cluster, some cameras see the lenses of other cameras in the border area. In addition, the blurring process introduces some additional border effects. For these reasons, in our experiments we only processed the inner regions of the images. We tested our algorithm on three real data sets. All of them contain two rigid motions: the camera and one moving object. 5.1
Controlled Motion
Our first experiment demonstrates motion segmentation in a scene with two controlled rigid motions. To control the motion, we mounted the camera cluster on a 1D translational platform, and a checkerboard on a rail. Between each frame, we shifted the camera and the checkerboard (4mm for the camera, 5mm for the checkerboard) along the directions of their rails. We captured six frames for a total of 42 real and synthetic images, and extracted intensity trajectories from these images. The classification results are presented in Fig. 1. Fig. 1(a) shows the segmentation without refinement. The pixel classification is super-imposed on the original image. Boundary pixels are not processed (they are marked as black). Dark gray and light gray are used to indicate foreground and background pixels, respectively. Ambiguous pixels computed in the refinement process are marked white in Fig. 1(b). The refined motion segmentation results are shown in Fig. 1(c). The misclassification error (number of mis-classified pixels divided by the number of all processed pixels) was 2.5%. Fig. 1(d) shows the similarity matrices permuted using the initial (top) and refined segmentations (bottom). We used the motion segmentation results with the differential tracking method proposed in [15]. At each frame, we used the seven local appearance samples acquired by the differential camera cluster to compute a first-order approximation 2
It only guarantees the motion subspace for each object to be 6D. The motion subspaces of different objects can still be partially dependent.
3D Motion Segmentation Using Intensity Trajectory
(a)
(b)
(c)
Motion estimation of X translation
Motion estimation of Z translation
35
25
35 Estimate Whole Real Back Estimate Back Real Front Estimate Front
30 translation (mm)
translation (mm)
30
20 15 10 5 0 1
165
25
Estimate Whole Real Back Estimate Back Real Front Estimate Front
20 15 10 5
2
(d)
3
4
5
6
0 1
2
3
4
frame
frame
(e)
(f)
5
6
Fig. 1. Motion segmentation and tracking results for a controlled sequence. (a) Segmentation results before refinement. (b) Segmentation results with ambiguous-pixels. (c) Segmentation results after refinement. (d) Similarity matrices before (top) and after (bottom) refinement. (e) and (f) Motion estimation of X translation. (f) Motion estimation of Z translation.
of the local appearance manifold. When new cluster samples were captured at the next frame time, we then estimated the incremental motion using a linear solver. The estimation results of the controlled motion (restricted to the X-Z plane of the camera coordinate frame) are shown in Fig. 1(e,f). There are five lines in both figures. The “Estimated Whole” line (black, dashed) indicates the motion computed using all of the non-boundary pixels under the assumption of a rigid scene. Using the segmentation results, we estimated different motion components in the scene. The upper and lower pairs of lines respectively represent the real (true) and estimated motion of the checkerboard and the background with respect to the camera. The estimate using all of the pixels (unsegmented) appears to be a weighted average of the two underlying motions, as one would expect, and is clearly wrong for either motion. The result using the segmented pixels appears to be very accurate for the background motion, and reasonably accurate for the foreground motion. Notice that we do not assume scene geometry. While the foreground moving objects in the experiments are planar, the moving backgrounds contain objects with different shapes at different depths. 5.2
Free-Form Motion
In the second experiment, we used our algorithm to segment two free-form rigid motions—both the camera cluster and the checkerboard were moved by hand. For each frame, we extracted pixel intensity trajectories across a window of 15
166
H. Yang et al.
adjacent frames. The motion segmentation results over 45 frames are shown in the first row of Fig. 2. Pixels corresponding to the moving checkerboard are marked white. The remaining pixels are classified as background. For a clearer representation, the boundary pixels are excluded. To explore the use of our algorithm in a single camera setting, we ran it again on the above sequence. But this time we only used images captured by one of the four physical cameras and three synthetic rotational cameras. Again, the intensity trajectories are extracted across a window of 15 frames. The results are shown in the second row of Fig. 2. Although only one physical camera is used, the segmentation results are reasonably good.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. Segmenting free-form rigid motions across a sequence of 45 frames. The checkerboard and the camera were moved by hand. Images (a)-(c) show the results on segmenting image sequences captured by a camera cluster. Images (d)-(f) show the results on segmenting image sequences captured using a single camera.
5.3
Motion Segmentation under Directional Lighting
Our last experiment demonstrates 3D motion segmentation in a scene with directional lighting using a single camera setup. The scene is illuminated with multiple ceiling lights and a directional light source from the left. A person sits on a chair and rotates. All light sources are static and constant. Again, we used images captured by one physical camera and three synthetic rotational cameras. Fig. 3(a) presents the illumination effect of the side light. The segmentation result is shown in (b)-(f). Pixels corresponding to the person and the chair are marked gray. Notice that most of the segmentation error are from pixels on the back of the chair. This is due to the plain texture in that area. In this experiment, the intensity trajectories were extracted from a window of 15 frames.
3D Motion Segmentation Using Intensity Trajectory
(a)
(b)
(c)
(d)
(e)
(f)
167
Fig. 3. Motion segmentation in a scene with directional lighting across 40 frames. A person was sitting on a chair rotating; the camera were moved by hand. (a) An image from the original sequence showing the person was illuminated by a directional light source from the left side. (b)-(f): Segmentation results on 5 frames from the sequence.
6
Conclusion and Future Work
We have presented a novel approach to 3D motion segmentation. Based on a local linear mapping between the changes of the pixel intensities and the underlying motions, we introduced the notion of pixel intensity trajectories, and formulated motion segmentation as clustering local subspaces spanned by those intensity trajectories. We have demonstrated our algorithm using some real data sets. Although we only discuss 3D rigid motion in this work, we believe the analysis can be extended to more general motion such as articulated, non-rigid motion. Just like parameterizing dI into a 6D space for rigid motion (see Equation (1)), for a general motion of rank m, we can map dI into an mD space represented by its motion parameters. In this case the motion vector dP becomes an mD vector. We can still decompose the intensity trajectory matrix W into the motion matrix M and the intensity Jacobian matrix F . All three matrices are of rank m. Thus the intensity trajectories of pixels corresponding to a general motion of rank m span an mD subspace.
References 1. Black, M.J.: The robust estimatoin of multiple motions: Parametric and piecewisesmooth flow fields. Computer Vision and Image Understanding 63(1) (1996) 2. Xiao, J., Shah, M.: Accurate motion layer segmentation and matting. In: IEEE Conference on Computer Vision and Pattern Recognition (2005)
168
H. Yang et al.
3. Wang, J., Adelson, E.: Representing moving images with layers. IEEE Transactions on Image Processing 3(5), 625–638 (1994) 4. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Readings in computer vision: issues, problems, principles, and paradigms, 726–740 (1987) 5. Costeira, J., Kanade, T.: A multi-body factorization method for motion analysis. In: International Conference on Computer Vision (1995) 6. Gear, C.: Multibody grouping from motion images. International Journal of Computer Vision 29(2), 133–150 (1998) 7. Vidal, R., Hartley, R.: Motion segmentation with missing data by power factorization and by generalized PCA. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 310–316 (2004) 8. Yan, J., Pollefeys, M.: A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 94–106. Springer, Heidelberg (2006) 9. Cascia, M.L., Sclaroff, S., Athitsos, V.: Fast, reliable head tracking under varying illumination: An apporach based on registration of texture-mapped 3d models. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(4) (2000) 10. Malciu, M., Preteux, F.: A robust model-based approach for 3d head tracking in video sequences. In: Fourth IEEE International Conference on Automatic Face and Gesture Recognition (2000) 11. Moritani, T., Hiura, S., Sato, K.: Real-time object tracking without feature extraction. In: International Conference on Pattern Recognition, pp. 747–750 (2006) 12. Zimmermann, K., Svoboda, T., Matas, J.: Multiview 3d tracking with an incrementally constructed 3d model. In: Third International Symposium on 3D Data Processing, Visualization, and Transmission (2006) 13. Murase, H., Nayar, S.: Visual learning and recognition of 3d objects from appearance. International Journal of Computer Vision 14(1) (1995) 14. Deguchi, K.: A direct interpretation of dynamic images with camera and object motions for vision guided robot control. International Journal of Computer Vision 37(1) (2000) 15. Yang, H., Pollefeys, M., Welch, G., Frahm, J.M., Ilie, A.: Differential camera tracking through linearizing the local appearance manifold. In: IEEE Conference on Computer Vision and Pattern Recognition (2007) 16. Zelnik-Manor, L., Machline, M., Irani, M.: Multi-body factorization with uncertainty: Revisiting motion consistency. International Journal of Computer Vision 68(1), 27–41 (2006) 17. Irani, M.: Multi-frame correspondence estimation using subspace constraints. International Journal of Computer Vision 48(3), 173–194 (2002) 18. Gruber, A., Weiss, Y.: Incorporating constraints and prior knowledge into factorization algorithms - an application to 3d recovery. In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds.) SLSFS 2005. LNCS, vol. 3940, pp. 151–162. Springer, Heidelberg (2006) 19. Kanatani, K.: Motion segmentation by subspace separation and model selection. In: International Conference on Computer Vision, pp. 586–591 (2001) 20. Tron, R., Vidal, R.: A benchmark for the comparison of 3d motion segmentation algorithms. In: IEEE Conference on Computer Vision and Pattern Recognition (2007) 21. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems (2001)
Vehicle Headlights Detection Using Markov Random Fields Wei Zhang, Q.M. Jonathan Wu, and Guanghui Wang Computer Vision and Sensing Systems Laboratory (CVSSL) Department of electrical and computer engineering, University of Windsor, Windsor, Ontario, Canada. N9B 3P4 {weizhang,jwu,ghwang}@uwindsor.ca
Abstract. Vision-based traffic surveillance is an important topic in computer vision. In the night environment, the moving vehicles are commonly detected by their headlights. However, robust headlights detection is obstructed by the strong reflections on the road surface. In this paper, we propose a novel approach for vehicle headlights detection. Firstly, we introduce a Reflection Intensity Map based on the analysis of light attenuation model in neighboring region. Secondly, a Reflection Suppressed Map is obtained by using Laplacian of Gaussian filter. Thirdly, the headlights are detected by incorporating the gray-scale intensity, Reflection Intensity Map, and Reflection Suppressed Map into a Markov random fields framework, which is optimized using Iterated Conditional Modes algorithm. Experimental results on typical scenes show that the proposed method can detect the headlights correctly in the presence of strong reflections. Quantitative evaluations demonstrate that the proposed method outperforms the existing methods.
1
Introduction
Vision-based traffic surveillance system extracts useful and accurate traffic information for traffic flow control, such as vehicle count, vehicle flow, and lane changes. The basic techniques for traffic surveillance include vehicle detection and tracking[6-9], surveillance camera calibration, etc. However, most of the state-of-the-art methods are concentrated on the traffic monitoring in the daytime and very few works address the issue of nighttime traffic monitoring. In the daytime, the vehicles are commonly detected by exploiting the gray scale, color, and motion information. In the nighttime traffic environment, however, the above information becomes invalid, and the vehicles can only be observed by their headlights and rear lights. Furthermore, there are strong reflections on the road surface (as shown in Fig.1), which makes the problem complicated and challenging. In our work, we concentrate on the vehicle headlights detection in gray-scale images. Several methods have been developed for vehicle headlights detection. In [1], Cucchiara and Piccardi detected the headlights by applying morphological analysis as well as taking advantage of headlights’ shape and size information. The H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 169–179, 2010. c Springer-Verlag Berlin Heidelberg 2010
170
W. Zhang, Q.M.J. Wu, and G. Wang
Fig. 1. The strong reflections on the road surface
detected headlights are then verified by matching the headlight symmetry as well as the luminance values along the normal to the main traffic direction. Chem and Hou [2] detected rear lights in the nighttime highway, and the reflector spots are removed using the brightness and area filtering. The rear-lights are then paired by using the properties of rear lights and the lanes. Chen et al. [3] employed color variation ratio to detect the ground illumination resulted from the vehicle headlights. The headlights information is extracted using a headlight classification algorithm. Cabani et al. [4] presented a self-adaptive stereo vision extractor of 3D edges for obstacle, and three kinds of vehicle lights are detected using the L*a*b* color space: rear lights and rear-brake-lights, flashing and warning lights, as well as reverse lights and headlights. Chen et al. [5] applied automatic multi-thresholds on the nighttime traffic images to detect the bright objects, which are processed by a rule-based procedure. The vehicles are identified by analyzing their headlights’ patterns and the their distances to the camera-assisted car. In this paper, we present a novel approach for vehicle headlights detection. Firstly, based on the analysis of light attenuation model in neighboring region, we introduce a Reflection Intensity Map in which reflections possess much higher intensity than the headlights. Secondly, a Reflection Suppressed Map is obtained by using Laplacian of Gaussian filter, and the reflection regions have much lower intensity in the proposed Reflection Suppressed Map. Thirdly, the vehicle headlights are detected by incorporating the gray-scale intensity, Reflection Intensity Map, and Reflection Suppressed Map into a Markov random field (MRF) framework, which is optimized using Iterated Conditional Modes (ICM) algorithm. Experimental results on typical scenes show that the proposed method can detect the headlights correctly in the presence of strong reflections. Quantitative evaluations demonstrate that the proposed method outperforms the state-of-the-art methods.
2
The Features
In common sense, the vehicle headlights possess the highest intensity in the image. However, there may be strong refections on the road surface, which may have as high intensity as the headlights and greatly deteriorate the performance of traditional methods. In this section, we introduce Reflection Intensity Map as well as Reflection Suppressed Map to discriminate the reflections from the vehicle headlights.
Vehicle Headlights Detection Using Markov Random Fields
2.1
171
Reflection Intensity Map
For the light sources in the night environment(such as vehicle headlights), there are commonly atmospheric scattering around the light sources. According to the Bouguers exponential law of attenuation [11], this atmospheric scattering can be modeled as follows. E(d, λ) = I0 (λ) · γ(λ) · exp(−d);
(1)
where I0 (λ) is the radiant intensity of the light source, γ(λ) is the total scattering coefficient for wavelength λ, and d is the distance from the light source to the scene point. Considering that the difference between the scattering on different points only depends on d, Equation (1) can be simplified as follows. E(d, λ) = E0 (λ) · exp(−d);
E0 (λ) = I0 (λ) · γ(λ)
(2)
According to Eq.(2), for a light scattering point E(d + Δ, λ), we can write it as the follows. E(d + Δ, λ) = E(d, λ) · exp(−Δ); (3) from which we can get the following relationship: for light scattering point E(d+ Δ, λ) of the light source E0 at a distance of d+Δ, we can consider its light source as E(d, λ) at a distance of Δ. Based on this relationship, we exploit the pixel’s neighborhood information to compute the Reflection Intensity Map. For pixel i (x, y) of image I, we define its interior neighboring region Θx,y and exterior e neighboring region Θx,y as follows. i Θx,y = {I(x + u, y + v) | 0 ≤ u ≤ r, 0 ≤ v ≤ r}; = {I(x + u, y + v) | 0 ≤ u ≤ 2 × r, 0 ≤ v ≤ 2 × r};
e Θx,y
2r r
I(x,y)
Interior neighboring region
Exterior neighboring region
Fig. 2. Interior neighboring region and exterior neighboring region
Fig. 3. Two examples of Reflection Intensity Map
(4)
172
W. Zhang, Q.M.J. Wu, and G. Wang
i e in which r is the width of Θx,y , while Θx,y has a width of 2 × r. In Fig.2, we illustrate the defined interior neighboring region and exterior neighboring region. i e We assume Θx,y and Θx,y are in the scattering of the same light source. Let i i M Ix,y be the pixel with the minimum intensity in Θx,y , and M Aix,y be the pixel i i with the maximum intensity in Θx,y . M Ix,y can then be deemed as the scatter i of M Aix,y with according to the Eq.(3). The scattering coefficient of Θx,y can be estimated as follows.
γ(x, y) =
i MIx,y MAix,y ×exp(−ix,y ) ;
(5)
i where ix,y is the distance between M Aix,y and M Ix,y . Assuming the scattering e e coefficient in Θx,y is also γ(x, y), we then employ the neighborhood Θx,y to compute the Reflection Intensity Map (RI) as follows. e RI(x, y) = |M Ix,y − M Aex,y · γ(x, y) · exp(−ex,y )|;
(6)
e where M Ix,y and M Aex,y are the pixel with the minimum and maximum intensity e e in Θx,y , respectively, and ex,y is the distance between M Aex,y and M Ix,y . e According to Eq.(3), M Ix,y can be considered as the scatter of M Aex,y with scattering coefficient being γ(x, y). Apparently, RI should take a low value in headlights’ ambient region; in the flat region(including the headlights), the RI also takes a low value because Reflection Intensity Map essentially is an edge detection method; in the reflection regions, RI may take on a high intensity. Two examples of the Reflection Intensity Map are shown in Fig.3 with the original images given in Fig.1. It can be seen that the strong reflections have much higher intensity than the headlights in the proposed Reflection Intensity Map.
2.2
Reflection Suppressed Map
Laplacian of Gaussian (LoG) filter is commonly used for edge detection. In this research, we use LoG filter to obtain the Reflection Suppressed Map, and the LoG filter is defined as follows. G(u, v) =
2 2 u2 +v2 −2σ2 exp(− u 2σ+v2 ); σ4
(7)
in which σ is the standard deviation. We normalize G to has a unity maximum G = value and let the results be G Max(G) . The LoG filter is illustrated in Fig.4(a). According to the atmospheric scattering model in [11], the intensity of the light source decreases in a exponential manner [as shown in Fig.4(b)]. When LoG filter is applied on the image, a high value can be obtained in the exponential decreasing region around the headlights, while a negative value in the light source region because of the negative value in the center of LoG filter. Let the negative applied on image I be S: resultant image of the filter G (8) G; S = −I
Vehicle Headlights Detection Using Markov Random Fields
173
2 0 −2 −4 −6 −8 30 30
20 20
10
10 0
0
(a)
(b)
Fig. 4. (a) the Laplacian of Gaussian filter, (b) the exponential attenuation property of light source
ଉय़ 22
6
20 5
16
Image Sum (10000)
Image Sum (10000)
18
14 12 10 8 6 4
4
3
2
1
2 0 0
0.5
1
1.5
2
0 0
Standard deviation of the LoG filter
0.5
1
1.5
2
Standard deviation of the LoG filter
Fig. 5. Reflection Suppressed Map and the corresponding F
in which denotes the convolution operation. In this implementation, the parameter σ has a large effect on S, and we calculate the sum of S as F : F = S (9) (x,y)
By setting σ to different values, we can obtain F as a function of σ, and σ is researches the highest set to the value that brings the minimum F . Because G correlation with I when the sum of F researches the minimum. Because of the property of LoG filter, S commonly has relative higher value on the headlights’ boundary, slightly lower value on the headlight, and negative value on the reflections region. Here we implement a flood-fill operation on S to further increase the intensity of headlights region, and let the resultant image be Reflection Suppressed Map (RS). In Fig.5, we present two example of RS (first row) as well as the corresponding F (second row)with the original images given in Fig.1. It can be seen that strong reflections have much lower intensity than headlights regions. We select the headlights pixels and strong reflections pixels in the ’High intensity’ sequence (see Fig.8), and the joint distribution of RI, RS, and I are depicted in Fig.6 using 2441 reflection pixels and 1785 headlights pixels. The strong reflection pixels are selected as those whose intensity is larger than 0.8, and and I, RI and RS are all normalized to range between 0 and 1. From the
174
W. Zhang, Q.M.J. Wu, and G. Wang
Headlights Reflections Reflection Suppressed Map (RS)
1 0.8 0.6 0.4 0.2 0 1 0.9 0.8
0.1
0 0.04 0.02 0.08 0.06 Reflection Intensity Map (RI)
Image Intensity
Fig. 6. Distribution of headlights and strong reflections on the image intensity, RS, and RI
figure, we can see that the headlights and reflections have distinct distribution in the proposed feature space.
3
MRF Based Headlights Detection
In this Section, we incorporate RI, RS, and I into a MRF framework to detect the vehicle headlights, and the MRF is optimized by using the ICM algorithm. Because the headlight regions commonly have high intensity, we apply a simple threshold τ on I, and let the resultant pixels be: κ = {(x, y) : I(x, y) > τ }
(10)
The headlights detection is performed on the pixels κ. Therefore, the problem comes down to find the optimal label image Ω based on feature (I, RS, RI). Let f = (I, RS, RI) and Ω has two kinds of labels: headlights α and reflections β. We then attempt to find Ω that maximizes the posterior probability (MAP) P (Ω|f ), according to Bayess theorem, which is proportional to: P (Ω|f ) ∝ P (f |Ω)P (Ω)
(11)
The optimal label becomes: Ω = argmaxP (f |Ω)P (Ω)
(12)
In a MRF model, one pixel’s label depends on the labels of its neighboring pixels, which is equivalent to a Gibbs process. In according to the the CliffHammersley theorem[12], P (Ω) with respect to a neighborhood is given as a Gibbs form as follows. P (Ω) = Z1 exp[−U (Ωϑ )]; (13) in which Z is the the sum of the numerator over labels and U(.) is the energy function. U(.) is commonly formulated as follows. U (Ω) = { [ρκ · ω(Ωκ , Ωϑ )]} (14) κ ϑ∈κ
Vehicle Headlights Detection Using Markov Random Fields 4 3 4 4 2 1 2 4 3 1 ț 1 3 4 2 1 2 4 4 3 4
e 5 e-2 e 5 e 5 e 2 e-1 e 2 e 5 -1 e-1 e-2 e-2 e 5 2 -1 2 5 e e e e e e 5 e-2 e 5
(a)
(b)
175
Fig. 7. (a) First to forth order neighborhood of site κ, (b) Markov parameters
Fig. 8. Experimental Results on the ’High intensity’ sequence
where ω(Ωκ , Ωϑ ) = −1 if Ωκ = Ωϑ ; and ω(Ωκ , Ωϑ ) = 0 if Ωκ = Ωϑ ; ρκ is the Markov parameter; κ is the set of spatial four order cliques (see Fig.7(a)). In this implementation, we set ρκ in an exponentially attenuating manner according to Eq.(2), as shown in Fig.7(b). This means that, for site κ, the effect of its neighboring labels decreases exponentially along the neighboring pixels’ distance to site κ. The observed image is commonly acquired through adding a noise process to the true image. Given the perfect image, we can get the density for Ω as: P (f |Ω) = P (fκ |Ωκ ) (15) κ
We model the conditional densities P (fκ |Ωκ ) as Gaussian distribution and can get P (fκ |Ωκ ) as follow. P (fκ |Ωκ = α) = P (fκ |Ωκ = β) =
σα σβ
1 √ 1 √
2
α) exp(− −(fκ2σ−μ ) 2 2π
exp(− 2π
α
−(fκ −μβ )2 ) 2 2σβ
(16)
where μα ,μβ ,σα , and σβ are the mean value and covariance matrix of the headlights and reflections. We employ the ICM algorithm [10] to obtain solution of Eq.(12). Because ICM uses a ’greedy’ algorithm to reach the iterative local minimization, and thus provides us the solution of MAP problem efficiently. Furthermore, ICM can get its convergence in a few iterations.
176
W. Zhang, Q.M.J. Wu, and G. Wang
The Ω is initialized using a random label. In every iteration of ICM, parameters, μα ,μβ ,σα ,and σβ , are updated as follows. μα = M (fκ |Ωκ = α) μβ = M (fκ |Ωκ = β) σα = D(fκ |Ωκ = α) σβ = D(fκ |Ωκ = β)
(17)
in which M (·) and D(·) represent the operations for mean value and covariance matrix calculation, respectively.
4
Experimental Results
We have applied the proposed approach on three typical traffic sequences, ’High intensity’,’High speed’,and ’Rainy’ sequence, to evaluate its effectiveness. In the ’High-speed’ sequence (see Fig.9), the vehicles are moving in a high speed; the ’High intensity’ sequence (see Fig.8) possesses a high traffic intensity; while the ’Rainy’ sequence (see Fig.10) is a rainy scene and has strong reflections on the road surface. The results of the headlights detection are illustrated in Fig.8-10, in which the vehicle headlights are depicted in cyan and reflections are shown in yellow. We can see that the proposed approach can detect the headlights robustly in all the sequences, and the strong reflections can be differed effectively from the headlights. It should be noted that some interferential objects such as street lamps are also detected. The performance of the proposed method is also evaluated quantitatively to get a systematic evaluation of the proposed method. Receiver Operator Characteristic (ROC) plots describe the performance of a classifier which assigns input data into dichotomous classes and is selected in the quantitative evaluation of the proposed method. The ROC plot is obtained by testing all possible threshold values, and for each value, plotting the true positive ratio (TPR) on the y-axis against the the false positive ratio (FPR) on the x-axis. The optimal classifier is
Fig. 9. Experimental Results on the ’High speed’ sequence
Vehicle Headlights Detection Using Markov Random Fields
177
Fig. 10. Experimental Results on the ’Rainy’ sequence
Quantitative Comparison on ROC
0.96
0.96
0.94 0.92 0.9 0.88
Cabani [4] Thou-Ho Chen [3] Yen-Lin Chen [5] Proposed
0.86 0.84 0.82 0.01
0.02 0.03 False Positive Ratio
0.04
0.94 0.92 0.9 0.88
Cabani [4] Thou-Ho Chen [3] Yen-Lin Chen [5] Proposed
0.86 0.84 0.82 0.8
0.8 0
Quantitative Comparison on ROC
1 0.98
True Positive Rate
True Positive Rate
1 0.98
0.05
0
0.01
0.02
0.03
False Positive Ratio
(a)
0.04
0.05
(b) Quantitative Comparison on ROC
1
True Positive Rate
0.98 0.96 0.94 0.92 0.9 0.88
Cabani [4] Thou-Ho Chen [3] Yen-Lin Chen [5] Proposed
0.86 0.84 0.82 0.8 0
0.01
0.02
0.03
False Positive Ratio
0.04
0.05
(c) Fig. 11. Quantitative evaluations on (a) ’High intensity’ sequence,(b)’High speed’ sequence, (c) ’Rainy’ sequence
characterized by an ROC curve passing through the top left corner (0, 1). Here, TPR and FPR are computed as follows: Truth positives T P R = The number of headlights pixels in ground truth False positives F P R = The number of reflections pixels in ground truth
(18)
where true positives are the number of headlights pixels that are correctly detected; false positives are the number of reflections pixels that are detected as
178
W. Zhang, Q.M.J. Wu, and G. Wang
headlights; ground truth is the correct detection result and is obtained by manual segmentation in our experiments. One parameter that affects our algorithm is the threshold τ in Eq.(10) and we tune τ to depict the ROC of the proposed method. We also quantitatively evaluated the methods in [3], [4] and [5] for comparison. For [3], we tune the parameter c in Eq.(5); for [4], we tune the threshold on the L channel of the utilized L ∗ a ∗ b color space; for [5], we tune the multilevel thresholds employed in segmenting bright objects. The resultant ROCs are given in Fig. 11, in which it can be found that the proposed method obviously outperforms other methods in all the sequences.
5
Conclusions
In this paper, we have proposed a robust headlights detection method. Reflection Intensity Map is introduced based on the analysis of light attenuation model, and Reflection Suppressed Map was obtained by correlating the image with a Laplacian of Gaussian filter. The headlights were detected by incorporating the gray-scale intensity, Reflection Intensity Map, and Reflection Suppressed Map into a Markov random field framework. Experimental results on typical scenes show that the proposed model can detect the headlights correctly in the presence of strong reflections.
Acknowledgement This work was particially funded by the Natural Science and Engineering Research Council of Canada.
References 1. Cucchiara, R., Piccardi, M.: Vehicle Detection under Day and Night Illumination. ISCS-IIA 99, 789–794 (1999) 2. Chern, M.-Y., Hou, P.-C.: The lane recognition and vehicle detection at night for a camera-assisted car on highway. In: IEEE International Conference Robotics and Automation (ICRA), September 2003, vol. 2, pp. 2110–2115 (2003) 3. Chen, T.-H., Chen, J.-L., Chen, C.-H., Chang, C.-M.: Vehicle Detection and Counting by Using Headlight Information in the Dark Environment. In: International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP), November 2007, vol. 2, pp. 519–522 (2007) 4. Cabani, I., Toulminet, G., Bensrhair, A.: Color-based detection of vehicle lights. In: Intelligent Vehicles Symposium, June 2005, pp. 278–283 (2005) 5. Chen, Y.-L., Chen, Y.-H., Chen, C.-J., Wu, B.-F.: Nighttime Vehicle Detection for Driver Assistance and Autonomous Vehicles. In: International Conference on Pattern Recognition(ICPR), vol. 1, pp. 687–690 (2006) 6. Lou, J.G., Tan, T.N., Hu, W.M., Yang, H., Maybank, S.J.: 3-D model-based vehicle tracking. IEEE Transactions on Image Processing 14, 1561–1569 (2005)
Vehicle Headlights Detection Using Markov Random Fields
179
7. Zhang, W., Jonathan Wu, Q.M., Yang, X., Fang, X.: Multilevel Framework to Detect and Handle Vehicle Occlusion. IEEE Transactions on Intelligent Transportation Systems 9, 161–174 (2008) 8. Kato, J., Watanabe, T., Joga, S., Liu, Y., Hase, H.: HMM/MRF-based stochastic framework for robust vehicle tracking. IEEE Transactions on Intelligent Transportation Systems 5, 142–154 (2004) 9. Morris, B.T., Trivedi, M.M.: Learning, Modeling, and Classification of Vehicle Track Patterns from Live Video. IEEE Transactions on Intelligent Transportation Systems 9, 425–437 (2008) 10. Besag, J.E.: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society. Series B 48, 259–302 (1986) 11. Bouguer, P.: Trait´e d’optique sur la gradation de la lumi`ere (1729) 12. Geman, S., Geman, D.: Stochastic relaxation: Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741 (1984)
A Novel Visual Organization Based on Topological Perception Yongzhen Huang1 , Kaiqi Huang1 , Tieniu Tan1 , and Dacheng Tao2 1
National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences, Beijing, China {yzhuang,kqhuang,tnt}@nlpr.ia.ac.cn 2 Cognitive Computing Group, School of Computer Engineering Nanyang Technological University, Singapore
[email protected]
Abstract. What are the primitives of visual perception? The early feature-analysis theory insists on it being a local-to-global process which has acted as the foundation of most computer vision applications for the past 30 years. The early holistic registration theory, however, considers it as a global-to-local process, of which Chen’s theory of topological perceptual organization (TPO) has been strongly supported by psychological and physiological proofs. In this paper, inspired by Chen’s theory, we propose a novel visual organization, termed computational topological perceptual organization (CTPO), which pioneers the early holistic registration in computational vision. Empirical studies on synthetic datasets prove that CTPO is invariant to global transformation such as translation, scaling, rotation and insensitive to topological deformation. We also extend it to other applications by integrating it with local features. Experiments show that our algorithm achieves competitive performance compared with some popular algorithms.
1
Introduction
Thirty years ago, Marr’s primal sketch theory [1], [2] claimed that the primitives of visual information are simple components of forms and their local geometric properties, e.g., line segments with slopes. Influenced by this famous theory, the description of visual information has made great progress in many computer vision applications, e.g., various image descriptors have been developed recently. One of the representatives is the Scale-Invariant Feature Transform (SIFT) [3]. Comparative studies and performance evaluation on image descriptors can be found in [4] wherein SIFT based descriptors achieve the best performance. SIFT descriptor constructs a histogram by accumulating the weighted gradient information around an interest point, as well as orientation reassignment around dominant directions. This strategy enhances the insensitiveness to noises and robustness to local rotation. However, it ignores the global information which is essential to encode the deformation invariability. In addition, it encounters a great challenging to describe a meaningful structure. Variances in position and H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 180–189, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Novel Visual Organization Based on Topological Perception
181
orientation are often substantial and suggest that a scale grouping rule is insufficient to achieve appropriate association of image fragments [5]. Although improvements emerge in specific applications, how to effectively organize local features is still very difficult and largely unexplored in computer vision. Perhaps, it is necessary to reconsider the intrinsic of the problem: what are the primitives of visual perception? In this sense, one may need, like Gestalt, the whole as a guide to sum the parts. Chen’s theory of topological perceptual organization (TPO) [6], [7], which assumes that wholes are coded prior to analysis of their separable properties or parts, is a view inherited from the Gestalt concept of perceptual organization. TPO is superior to the early feature-analysis theory in organizing local features and describing topological structures. Details are described in the next section. In this paper, we propose a novel visual organization based on Chen’s theory, termed computational topological perceptual organization (CTPO). To prove the effectiveness of the proposed CTPO, we conduct empirical justifications on synthetic datasets. Inheriting from the superiority of Chen’s theory, CTPO is invariant to translation, scaling, rotation and insensitive to topological deformation. Besides, we integrating CTPO with popular local features and extend it for object categorization. Experiments show that CTPO achieves competitive performance compared with top level algorithms.
2
Topological Perceptual Organization
Chen’s topological perceptual organization theory [6], [7] is a view inherited from the early holistic registration theory and the Gestalt psychology. Furthermore, it developed the Gestalt psychology and used thorough experiments to well support that 1) visual perception is from global to local; 2) wholes are coded prior to local; 3) global properties can be represented by the topologically invariant features. The most important concept in Chen’s theory is “perceptual object”, which is defined as the invariant in topological transformations. A topological transformation [7] is, in mathematical terminology, a one-to-one and continuous transformation. Intuitively, it can be imagined as an arbitrary “rubber-sheet” distortion, in which neither breaks nor fusions can happen, e.g., a disc smoothly changing to a solid ellipse. Klein’s Erlangen Program [8] shows that the topological transformation is the most stable among all geometric transformations. Moreover, it has been proved by neuroscience research [9] that, in human vision system, the topological transformation is the strongest stimulation among all the transformations. Chen’s theory has been strongly supported by psychological and biological experiments. For example, they tested the response of bees on different topological shapes. In the first stage, bees are trained to find a specific object (a ring) with a reward of sugar water. In the second stage, the reward is removed to test bees’ reaction on different new object. In test 1, results show that bees are in favor of the diamond image which is topologically identical to the ring image. In test 2, results show that bees have no marked feeling for both shapes because the ring
182
Y. Huang et al.
and the square share the same topological structure although they are different in local features. Apparently, experiments show that it is hard to differentiate topologically identical shapes. Experimental results are consistent with Chen’s assumption that the topological structure is the fundamental component of the visual vocabulary. A comprehensive description and discussion of his theory may be found in a whole issue of “Visual Cognition” [7], [10], [11]. Chen’s theory has opened new lines of research that are worthy of attention from not only visual psychologists but researchers in computer vision.
3
Computational Topological Perceptual Organization
Although Chen’s theory of topological perceptual organization makes great progress in visual psychology, it does not provide a mathematical form to describe the topological properties of a structure. In this paper, we propose a computational topological perceptual organization (CTPO) to bridge the gap between Chen’s psychological theory and the computer vision theory. It is worth emphasizing that the concept of “global” in Chen’s theory does not refer to the visual information of a whole image but a topological structure of an object. For a whole complex image, it is hard to describe its topological properties by Chen’s theory currently. 3.1
Topology Space
The core idea of topological perceptual organization is that “perceptual object” preserves invariants in topological transformation. According to Chen’s theory, the connectivity and the number of holes are essential to describe the properties of a topological structure because they are invariant measurements in topological transformations. We adopt the distance between pairs of two pixels to describe the topological structure. However, the Euclidean measure is apparently not a good candidate. For example, in Fig. 1, two red crosses appear deceptively close, measured by their spatial Euclidean distance. But their spatial connectivity distance is large and reflects their intrinsic spatial relationship. Intuitively,
Fig. 1. An example of illustrating the difference between Euclidean distance and geodesic distance
A Novel Visual Organization Based on Topological Perception
183
it is necessary to define the topological property in such a space, wherein the distance between two points can reflect the connectivity of a structure and a group of such distances can reflect the number of holes. Therefore, the geodesic distance or the shortest path is a good choice. The geodesic distance, however, encounters a problem: how to define global properties (e.g., connectivity) in a discrete set of image pixels? Fortunately, the tolerance space [7] can be applied to deal with this problem. Definition 1. Let X be a finite set of discrete dots. The tolerance ξ refers to the range within which detailed variations are ignored for attaching importance to global properties. The set of dots X together with the tolerance is a tolerance space denoted as (X, ξ). The tolerance and the global properties of a discrete set, therefore, are mutually dependent concepts. Fig. 2 shows an example. For human vision system, the tolerance ξ means the shortest noticeable distance. Under this definition, two points are connective only if they are in a specific tolerance. The notion of tolerance space not only resolves the problem of describing connectivity in a discrete set of image pixels but also builds the relationship between the scale and the structure.
Fig. 2. Illustration of the tolerance space. If the tolerance is 1mm, (a) and (c) share an identical topological structure (both of them are “two rings”). This description matches our holistic perception. As the tolerance increases to a specific value, e.g., 4mm, (a) and (b) share an identical topological structure. So do (c) and (d).
Based on the above analysis, we propose a topology space d∗ , defined as: d∗ = g(d ),
(1)
where g is the operation of calculating geodesic distance. The distance between two pixels in the space d is computed as: d1 (i, j) + λ × d2 (i, j), if d1 (i, j) < ξ d (i, j) = , (2) ∞, otherwise where d1 and d2 denote the spatial Euclidean distance and the intensity difference respectively, λ is a tradeoff parameter, and ξ is the tolerance in Definition 1. The topology space considers the connectivity in two aspects: the spatial distance and the intensity difference. The latter one weakens the impact of pixels that are greatly different from their neighbors in terms of intensity.
184
3.2
Y. Huang et al.
Quotient Distance Histogram
In this section, we construct a quotient distance histogram (QDH) in the aforementioned topology space to describe topological properties. We adopt the quotient between d∗ (i, j) and d (i, j) as the vote to construct a histogram. This intuition is reasonable because d∗ (i, j) contains rich structural information and the quotient between d∗ (i, j) and d (i, j) is scale-invariant. The value of each bin in the histogram is given by: h(k) =
n n
I (θ(i, j) ∈ B(k)) ,
(3)
i=1 j=i+1
θ(i, j) = d∗ (i, j) / d (i, j),
(4)
where n is the number of pixels in the structure, I is the indicator function, B(k) is the range of k th bin, d∗ (i, j) and d (i, j) are defined in Eq. (1) and Eq. (2). Note that d (i, j) may be infinite, and then d∗ (i, j) is infinite too. In this special case, we set θ(i, j) to infinity. QDH can describe various topological structures. Its effectiveness is demonstrated in the experiment of Section 4.1. It is important to understand that QDH is different from the Geodesic Intensity Histogram (GIH) [12] or the inner-distance [13] because: 1) QDH reflects the global statistical property by the quotient between d∗ (i, j) and d (i, j). The inner-distance is defined as the shortest path between mark points of a shape silhouette. Under this definition, even a “W” structure and a “S” structure cannot be differentiated by GIH. Besides, QDH is scale-invariant while GIH is not; 2) Besides the spatial distance, QDH also consider the intensity difference of image, thus it can be applied for gray images, while the inner-distance can only be used for binary images; 3) QDH is associated with the tolerance space, which emulates the visual characteristics of human vision system; and 4) the motivations of QDH and the inner-distance are totally different. QDH is inspired by a significant visual psychology theory.
4
Experiments
In the experimental part, we first conduct empirical justifications on synthetic datasets to differentiate basic topological structures. Second, we combine it with another model (EBIM [14]) and apply it for object categorization. There are three parameters in CTPO: 1), ξ is the tolerance. To obtain √ √ robust √ performances, it is necessary to set a group of tolerances, e.g., 1, 2, 3, 2, 5 in our experiments. 2), λ is a tradeoff parameter between spatial distance and intensity difference. In our experiments, we empirically set λ to 0.5, which is robust in most of cases. 3), B(k) is the range of each bin in the CTPO histogram. It is insensitive to the performance. In fact, we calculate the probability distribution of θ(i, j) on the training set (or a part of original samples), and then equally divide the distribution. Then, every two division points can determine the range of a bin.
A Novel Visual Organization Based on Topological Perception
4.1
185
Structure Classification
In this experiment, we design two experiments to demonstrate the discriminative ability of CTPO on different topological structures tested in Chen’s psychological
Fig. 3. Examples of artificial (left) and real (right) images. The histogram in each row is the mean quotient distance histogram corresponding to the images of the row. solid round solid rectangle sinle ring single rectangle double rings double rectangles cross double holes paralel
0.2 0 −0.2 −0.5
solid round solid rectangle single ring single rectangle double rectangles double rings cross double holes parallels
0.05 0 −0.05 0.1
0
0.5
0.2
−0.2
0
−0.4
0
−0.6
−0.05 0 −0.1
0.05
(a) TCPO performance on artificial images. (b) SIFT performance on artificial images. solid round single ring double ring cross double holes paralle
0.1 0
solid round single ring double ring cross double holes paralle
0.2 0.1 0
−0.1 −0.5
−0.1 0.1 0
0.5
0
0 0.5 −0.5
(c) TCPO performance on real images.
−0.1
−0.2
−0.1
0
0.1
0.2
(d) SIFT performance on real images.
Fig. 4. (Please view in color) Comparison between CTPO and SIFT-like descriptor in preserving the geometric nature of the topological structures manifold. In (a) and (c), different topological structures are effectively differentiated by our approach for artificial and real images respectively. In (b), some different topological structures of artificial images are mixed by SIFT-like descriptor. In (d), SIFT-like descriptor has little ability of differentiating various topological structures.
186
Y. Huang et al.
research. In the first experiment, artificial images are used. For each category of topology structures, we first draw an initial image, and then produce some variants by using topological transformations defined in Section 2, e.g., translation, scaling rotation and deformation. In the second experiment, we choose real patches sampled from the images. Fig. 3 shows some examples of artificial shapes and real image patches (with noises). From top to bottom, they are the round, the single ring, the double rings, the cross, the double holes and the parallel. We also show their corresponding mean histograms in Fig. 3. Each mean histogram is the mean over all quotient distance histograms of a class of artificial shapes or real image patches. The mean histogram is almost equivalent for artificial shapes and real image patches in the same row, which proves that our algorithm can be applied to real images and robust to noises. We compare the proposed quotient distance histogram with SIFT-like descriptor. To implement the SIFT-like descriptor, we use 8 orientations for the gradient histogram calculation in all divided (4×4) areas to construct a standard feature vector (128 dimensionality). Then the Isomap algorithm [15] is applied for dimensionality reduction to intuitively compare their ability of preserving the intrinsic properties of the topological structure. Fig. 4 shows the experimental results. Our approach (CTPO) greatly performs the SIFT-like descriptor in both artificial and real images. The experimental results are consistent with Chen’s theory, and prove the effectiveness of the proposed topology space as well as the quotient distance histogram in differentiating various topological structures. 4.2
Object Categorization
According to Chen’s theory, the topological perceptual organization (TPO) should be applied to images or image patches with specific topologies. For advanced computer vision applications, e.g., object categorization, it is difficult to extract topological structures of real images with complex backgrounds. Therefore, we combine
Fig. 5. The object categorization framework for experiments. The black part (the upper part) is EBIM [14] and the red part (the lower part) is CTPO. C1 unit is the processed images after being filtered by Gabor filters.
A Novel Visual Organization Based on Topological Perception
187
CTPO with EBIM [14]. The latter one has been demonstrated to be an effective computational model for object categorization. The framework of EBIM is shown in the upper part of Fig. 5. More details about EBIM can be found in [14]. The reasons why we combine these two models are listed below. – After processed by Gabor filters in EBIM, most meaningless points in the C1 units (in Fig. 5) will be eliminated and thus we can extract topological structures efficiently and effectively. – EBIM matches an image with a large number of patches randomly sampled from the C1 unit of training images. The random sampling technique loses its effectiveness and generalization when the number is small. Thus EBIM has to randomly draw a large number of patches with an unacceptable time cost in the training stage. Besides, meaningless patches tend to over-fit training samples. Therefore, it is necessary to keep patches with meaningful structures. Specifically, we use CTPO to extract their quotient distance histograms to cluster the original patches. Afterward, patches with meaningful topological structures are preserved and processed in the later computation. Fig. 5 shows the framework of the experiments. In the following, we conduct object categorization experiments on two databases. The purpose of these experiments is to test the ability of CTPO in enhancing EBIM for object categorization. MIT-CBCL Street Scene database. The MIT-CBCL street scene database [16] contains three kinds of objects: car, pedestrian and bicycle. The training and testing examples have been divided in the database for experiments. Table 2 shows the performance comparison among our approach and C1 [17], [18], HoG [19], C1+Gestalt [18] and EBIM [14]. Our approach achieves the best performance in car and bicycle detection and is comparable to the best in pedestrian detection. The computation cost of our approach in the tests is comparable to EBIM and much less than the others. Besides, compared with EBIM, the time of training is reduced from about 20 hours to about 2 hours because our approach removes a large number of meaningless patches for matching. Table 1. Object detection results obtained by several state-of-the-art methods on the MIT-CBCL Street Scene database. The performance is measured in terms of EER (Equal-Error-Rate). The last column is the averaged time cost to process an image (128×128) in test. Categories C1 HoG C1+Gestalt EBIM Ours
Car
Pedestrian
Bicycle
time
94.38 91.38 96.40 98.54 98.80
81.59 90.19 95.20 85.33 90.28
91.43 87.82 93.80 96.49 96.86
− ≈ 0.5s > 80s ≈ 0.02s ≈ 0.02s
188
Y. Huang et al.
GRAZ Database. The GRAZ database [20], built by Opelt et al., is a more complex database. To avoid the limitation that certain methods tend to emphasize background context, the backgrounds of the images in GRAZ- 02 are similar in all categories of objects. According to the depiction in [20], we strictly follow their way: 100 positive samples and 100 negative samples are randomly chosen as training samples. 50 other positive samples and 50 other negative samples are chosen as testing samples. The experimental results on the GRAZ-02 database are shown in Fig. 6 by ROC curves. Although our algorithm does not achieve the best performance compared with [21], [22], it is still comparable to them and we consider that CTPO is promising as psychophysically inspired initial research. ROC curve for Bikes (Graz−02)
ROC curve for Persons (Graz−02) 1
0.8
0.8
0.8
0.6 0.4
EBIM+TCPO Moment Invariants + Aff.lnv. Basic Moments + Aff.lnv. DoG + SIFT SM + intensity distributions
0.2 0 0
0.2
0.4 0.6 False Positive Rate
0.8
0.6 0.4
EBIM + CTPO Moment Invariants + Aff.Inv Basic Moments + Aff.Inv DoG + SIFT SM + intensity distributions
0.2
1
True Positive Rate
1
True Posotive Rate
True Positive Rate
ROC curve for Cars (Graz−02) 1
0 0
0.2
0.4 0.6 False Positive Rate
0.8
0.6 0.4
EBIM+TCPO Moment Invariants + Aff.lnv. Basic Moments + Aff.lnv. DoG + SIFT SM + intensity distributions
0.2
1
0 0
0.2
0.4 0.6 False Positive Rate
0.8
1
Fig. 6. (please view in color) Comparison of ROC curves on GRAZ-02 database. EBIM+TCPO is our approach. Other results are reported on [20].
5
Conclusion
In this paper, we have analyzed representative research about the primitives of visual perception. Inspired by Chen’s theory of topological perceptual organization, a great breakthrough in visual psychology, we propose a computational topological perceptual organization (CTPO). The most important contribution in this paper is that we bridge the gap between Chen’s psychological theory and the computer vision theory. Specifically, we have analyzed Chen’s theory from the viewpoint of computational vision and building a computational model for computer vision applications. Empirical studies have proved that CTPO is consistent with Chen’s theory and outperform some popular algorithms. It is necessary to emphasis that CTPO is not designed for some specific applications but a new viewpoint to understand the organization of primal vision information in computer science. It can be easily extended to many other applications. The success of CTPO envisions the significance and potential of the early holistic registration in computer vision.
Acknowledgment This work is supported by National Basic Research Program of China (Grant No. 2004CB318100), National Natural Science Foundation of China (Grant No. 60736018, 60723005), NLPR 2008NLPRZY-2, National Hi-Tech Research and Development Program of China (2009AA01Z318), National Science Founding (60605014, 60875021).
A Novel Visual Organization Based on Topological Perception
189
References 1. Marr, D.: Representing visual information: A computational approach. Lectures on Mathematics in the Life Science 10, 61–80 (1978) 2. Marr, D.: A computational investigation into the human representation and processing of visual information. Freeman, San Francisco (1982) 3. Lowe, D.G.: Distinctive image features from dcale invariant key-points. International Journal of Computer Vision 2(60), 91–110 (2004) 4. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(10), 1615–1630 (2005) 5. Filed, D.J., Hayes, A., Hess, R.F.: Good continuation and the association field: Evidence for localfeature integration by the visual system. Vision Research 33, 173–193 (1993) 6. Chen, L.: Topological structure in visual perception. Science 218, 699–700 (1982) 7. Chen, L.: The topological approach to perceptual organization. Visual Cognition 12(4), 553–638 (2005) 8. Klein, F.: A comparative review of recent researches in geometry. Mathematische Annalen 43, 63–100 (1872) 9. Zhuo, Y., Zhou, T.G., Rao, H.Y., Wang, J.J., Meng, M., Chen, M., Zhou, C., Chen, L.: Contributions of the visual ventral pathway to long-range apparent motion. Science 299, 417–420 (2003) 10. Todd, J., et al.: Commentaries: stability and change. Visual Cognition 12(4), 639– 690 (2005) 11. Chen, L.: Reply: Author’s response: Where to begin? Visual Cognition 12(4), 691– 701 (2005) 12. Ling, H.B., Jacobs, D.W.: Deformation invariant image matching. In: ICCV (2005) 13. Ling, H.B., Jacobs, D.W.: Shape classification using the inner-distance. IEEE Trans. on Pattern Analysis and Machine Intelligence 29(2), 286–299 (2007) 14. Huang, Y.Z., Huang, K.Q., Tao, D.C., Wang, L.S., Tan, T.N., Li, X.L.: Enhanced biological inspired model. In: CVPR (2008) 15. Tenenbaum, J.B., Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality. Science 290(22), 2319–2323 (2000) 16. http://cbcl.mit.edu/software-datasets/streetscenes 17. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE Trans. on Pattern Analysis and Machine Intelligence 29(3) (2007) 18. Bileschi, S., Wolf, L.: Image representations beyond histograms of gradients: The role of gestalt descriptors. In: CVPR (2007) 19. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 20. Opelt, A., Pinz, A., Fussenegger, M., Auer, P.: Generic object recognition with boosting. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(3), 416– 431 (2006) 21. Leordeanu, M., Hebert, M., Sukthankar, R.: Beyond local appearance: Category recognition from pairwise interactions of simple features. In: CVPR (2007) 22. Ling, H., Soatto, S.: Proximity distribution kernels for geometric context in category recognition. In: ICCV (2007)
Multilevel Algebraic Invariants Extraction by Incremental Fitting Scheme Bo Zheng1 , Jun Takamatsu2 , and Katsushi Ikeuchi1 1
Institute of Industrial Science, The University of Tokyo, Komaba 4-6-1, Meguro-ku, Tokyo, 153-8505 Japan {zheng,ki}@cvl.iis.u-tokyo.ac.jp 2 Nara Institute of Science and Technology (NAIST), 8916-5 Takayama, Ikoma, NARA 630-0192 Japan
[email protected]
Abstract. Algebraic invariants extracted from coefficients of implicit polynomials (IPs) have been attractive because of its convenience for solving the recognition problem in computer vision. However, traditional IP fitting methods fixed the polynomial degree and thus lead to the difficulty for obtaining appropriate invariants according to the complexity of an object. In this paper, we propose a multilevel method for invariant extraction based on an incremental fitting scheme. Because this fitting scheme incrementally determines the IP coefficients in different degrees, we can extract the invariants from different degree forms of IP coefficients during the incremental procedure. Our method is effective, not only because it adaptively encodes the appropriate invariants to different shapes, but also we encodes the information evaluating the contribution of shape representation to each degree invariant set, so as to have better discriminability. Experimental results demonstrate the better effectiveness of our method compared with prior methods.
1
Introduction
The feature of shapes encoded by Implicit Polynomials (IPs) plays an essential role in computer vision for solving the problems on recognition [1,2,3,4,5,6,7], registration [8,2,9,10,11] and matching [3,2,12,10], because it is superior in the areas of fast fitting, algebraic/geometric invariants, few parameters and robustness against noise and occlusion. There have been great improvements concerning IPs with its increased use during the late 1980s and early 1990s [3,1]; Recently, new robust and consistent fitting methods like 3L fitting [13], gradient-one fitting [14], Rigid regression [14,15], and incremental fitting [16] make them feasible for real-time applications for object recognition tasks. The main advantage of implicit polynomials for recognition is the existence of algebraic/geometric invariants, which are functions of the polynomial coefficients that do not change after a coordinate transformation. The algebraic invariants that are found by Taubin and Cooper [8], Teral and Cooper [2], Keren [1] and Unsalan [17] are global invariants and are expressed as simple explicit functions of the coefficients. Another set of invariants that have been mentioned by H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 190–200, 2010. c Springer-Verlag Berlin Heidelberg 2010
Multilevel Algebraic Invariants Extraction by Incremental Fitting Scheme
191
Wolovich et al. is derived from the covariant conic decompositions of implicit polynomials [7]. However, these invariant extraction methods are based on the traditional fitting scheme that usually fixed the degree of IP for in fitting procedure, regardless of the complexity of shapes. From the view of accurate shape description, there is no doubt that a complex object requires more invariants encoded by higher degree IP, whereas a simple object only needs less invariants encoded by lower degree IP. Therefore, unfortunately the method of using invariants encoded by same degree IPs leads to the difficulty on recognition accuracy, especially while dealing with a large database holding various shapes. Our previous work proposed an incremental scheme for IP fitting that can adaptively determine the degree of IP to be suitable for various shapes [16]. The method makes it possible that the invariants are extracted from IPs of different degrees according to the different complexity levels of shapes and obviously that would enhance the accuracy for recognition. In this paper, we first extend our incremental scheme to make it suitable for invariants extraction. Then we propose multilevel invariant extraction method based on combination of Taubin’s invariants, degree information and fitting accuracy information. The reported experimental results demonstrate the effectiveness of our method that can extract the richer invariants exactly compared with prior methods. The rest of this paper is organized as follows. Section 2 reviews recent advances in IP modeling and introduces the incremental framework [16]. Section 3 presents our multilevel extraction method based on a form-by-form incremental scheme and practical method of combining the degree and fitting accuracy information. Section 4 reports our experimental results followed by conclusion in Section 5.
2 2.1
Incremental Scheme of Implicit Polynomial Fitting Implicit Polynomial
IP is an implicit function defined in a multivariate polynomial form. For example, the 3D IP of degree n is denoted by: fn (x) = aijk xi y j z k = al ml (x). (1) 0≤i,j,k;i+j+k≤n
l
where x = (x y z) is a 3D data point; ml (x) = xi y j z k is the monomial function with the accompanying coefficient al . Note, the relationship between indices l and {i, j, k} are determined by the inverse lexicographical order(see Tab. 1 in [16]). The homogeneous binary polynomial of degree r, hr = i+j+k=r aijk xi y j z k , is called the r-th degree form of the IP, and the highest (n-th) degree form is called leading form. If hr can be described as hr = pTr mr , where pr is coefficient vector of the r-th form as: pr = (ar,0,0 , ar−1,1,0 , . . . , a0,0,r )T and mr is the
192
B. Zheng, J. Takamatsu, and K. Ikeuchi
corresponding monomial vector: mr = (xr , xr−1 y, . . . , z r )T , then the polynomial fn can be described as: fn =
n r=0
2.2
hr =
n
pTr mr .
(2)
r=0
Least-Squares Fitting
In general, building an IP model can be regarded as a regression problem, which approximates a multivariate IP function from scattered data. This typically formulates the issue as to solve the following minimization problem: min f
N
dist(bi , f (xi )),
(3)
i=1
where dist(·, ·) is a distance function; the common choice is the L2 norm as: dist(bi , f (x)) = (bi − f (xi ))2 . bi is the offset term for implicit polynomial fitting which can be determined by different optimization constraints from a method such as 3L method [13]. Substituting the representation of Eq. (1) into Eq. (3), we get to minimize the convex differentiable objective function E(a) as: min E(a) = minm (M a − b)2 = minm aT M T M a − 2aT M T b + bT b. a∈R a∈R
(4)
where M is the matrix whose l-th column ml is (ml (x1 ), ml (x2 ), . . . , ml (xN ))T (see Eq. (1)); a is the unknown coefficient vector of IP; and b is the offset term vector. The derivatives of E(a) with respect to the variable a vanish for optimality: ∂E = 2M T M a − 2M T b = 0, ∂a
(5)
which leads to the following linear system of equations to be solved. M T M a = M T b.
(6)
Because solving the linear system (6) requires that the fixed IP degree must be assigned to construct the inverse of the coefficient matrix M T M , none of the prior methods allow changing the degree during the fitting procedure. 2.3
Column-by-Column Incremental Fitting Process
For designing an incremental scheme [16], Eq. (6) is solved by the QR decomposition method. That is, matrix M (∈ RN ×m ) is decomposed as: M = QR (Q ∈ RN ×m and R ∈ Rm×m ), where Q satisfies: QT Q = I (I is an m × m identity matrix), and R is an upper triangular matrix. Then, substituting M = QR
Multilevel Algebraic Invariants Extraction by Incremental Fitting Scheme
193
Fig. 1. The incremental scheme: dimension of the upper triangular linear system grows at each step and only the elements shown in dark gray need to be calculated.
into Eq. (6), we obtain a linear system with an upper-triangular coefficient matrix R derived as: RT QT QRa = RT QT b → Ra = QT b → Ra = b.
(7)
In order to adapt the fitting to automatically determine the coefficients according to shape complexity, our previous work proposed a column-by-column incremental scheme by taking QR decomposition onto each dimensionally incremental matrix M , and then the upper-triangular linear system (7) will be solved incrementally until the desired fitting accuracy can be satisfied. This process can be generated by Gram-Schmidt process by continuously orthogonalizing the incoming column vector ml . This process is illustrated in Fig. 1, where the dimension of the upper triangular linear system increases, and thus the coefficient vector a with an incremental dimension can be obtained at any step by solving the triangular linear equation system in little computational cost. This incremental scheme works efficiently because the calculation for dimension increment between two successive steps is computationally efficient. Fig. 1 illustrates this fact that, in this incremental scheme, only the right-most column l need to be calculated which are shown with dark of Rl and the last element of b gray blocks.
3 3.1
Multilevel Invariants Extraction Form-by-Form Incremental Scheme
The column-by-column Incremental method cannot be directly used for extracting the invariants from IP’s leading form, because it cannot guarantee to obtain a complete leading form of IP which is necessary for calculating the invariants [8]. To make the incremental scheme suitable for invariants extraction, we first modify the incremental scheme to be in a form-by-form incremental manner; that is the monomial matrix will increase multiple columns corresponding to one form rather than increasing one column at each step. Thus, the incremental process is the iteration using the index r in Eq. (2) but not l in Eq. (1). Fig. 2 illustrates this process, where the necessary elements for calculation at each step are shown in dark gray. Note, this process can be maintained also by
194
B. Zheng, J. Takamatsu, and K. Ikeuchi Rr −1
L
~ br −1
ar −1
Rr
= Sr−1
pr −1
~ br
ar
Rr +1
=
cr −1
pr
Sr
=
r-th step
L
cr
Sr +1 (r-1)-th step
~ br +1
ar +1
pr +1
cr +1
(r+1)-th step
Fig. 2. The from-by-form incremental scheme: dimension of the upper triangular linear system grows by one form at each step and the elements shown in dark gray need to be calculated.
Fig. 3. Top: original objects in different noise levels or with missing data; Bottom: IP fits of degree 4, 4, 8 and 8 respectively
the Gram-Schmidt process [16], the only difference is that we stop the iteration at the step when an integral form of IP can be obtained. The reason why we need do this is that the form-by-form incremental scheme guarantees that the fitting can be in the Euclidean invariant way [14] and thus we can extract the Euclidean invariants from the obtained forms of IP as described in following section. Fig. 3 shows some fitting examples of our method. We can see that incremental fitting scheme can not only determine the degree for different shapes, but also keep the characteristic of IP, the robustness against noise and missing data. 3.2
Extracting Invariants by Incremental Scheme
A simple but practical method proposed by Taubin and Cooper [8] shows us that invariants of an IP can be extracted from a specific coefficient form, which is introduced briefly in Appendix A. Taubin and Cooper also showed that invariants obtained from the leading form are practically powerful in their recognition examples in [8]. Combining their method, our fitting method can be modified as a multilevel method for invariants extraction.
Multilevel Algebraic Invariants Extraction by Incremental Fitting Scheme
195
Fig. 2 also shows the fact that, at each step, we can obtain a triangular linear sub-system of equations: Sr pr = cr ,
(8)
where Sr is the right-lower triangular sub-matrix of Rr , and cr is the lower sub r ; then by solving this linear sub-system we can obtain the coefficient vector of b vector pr corresponding to the r-th form of IP. Note, since pr is always the highest degree form of the IP at current step, we call each pr the leading form. Thus, we can design our descriptors to be a set of leading forms’ invariants in incremental degrees. We extract the invariants Ir from pr , the r-th leading form, by using the function of Eq. (17) in Appendix. If we arrange them into a vector I as: I = {I1 , I2 , . . . , In }, then we take I as our shape descriptor. Note, for different objects descriptors of I do not have to keep in the same dimension, e.g., two descriptors can be compared by only using the corresponding elements. 3.3
Weighting Function
For further enhancing discriminability, let us present a weighting method that determines a weight wr for the set of invariants Ir to evaluate how it is important to represent the shape. Then the invariant I can be modified as: I = {w1 I1 , w2 I2 , . . . , wn In }.
(9)
An effective evaluation for similarity of shape representation is to use the leastsquares error E defined in Eq. (4). Therefore, in the process shown in Fig. 2, if we can find out how much the error decrease that the current step made, we know how important the current coefficient set pr is to the shape representation, and thus we know how important the current invariant set Ir is in the whole invariant vector I. Fortunately, with our incremental scheme, we can find this information easily by following calculation. r 2 − b r−1 2 = cr 2 . E(ar ) − E(ar−1 ) = b
(10)
We obtained Eq. (10) because E(ar ) can be calculated as: E(ar ) = aTr MrT Mr ar − 2aTr MrT b + bT b = aTr RrT QTr Qr Rr ar − 2aTr RrT QTr b + bT b
= (Rr ar )T (Rr ar ) − 2(Rr ar )T (QTr b) + bT b
T T b T =b r r − 2br br + b b r 2 + b 2 , =−b
(11)
r where Mr is the matrix holding the monomial columns, and Rr , ar and b are respectively the upper triangular matrix, parameter vector, and right-hand vector at the r-step, see Fig. 2. From Eq. (10), we can see the fact that the larger
196
B. Zheng, J. Takamatsu, and K. Ikeuchi
1
1
1
1
1
0.9
0.9
0.9
0.9
0.9
0.8
0.8
0.8
0.8
0.8
0.7
0.7
0.7
0.7
0.7
0.6
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
0
5
10
15
0
20
0
5
10
I
×
0
20
0
5
10
15
0
20
0
0.3 0.2 0.1
5
10
I
×
15
0
20
0
×
× 0.7
0.6
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.5
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
0
2
4
6
0
8
0
2
Number of form
4
6
0
8
0
2
Number of form
↓
4
6
0
8
Weight
0.8
0.7
Weight
0.8
0.7
Weight
0.8
0.7
0.4
0.4
0.2 0.1
0
2
4
6
0
8
0
↓
0.7
0.7
0.7
0.7
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
10
15
20
25
0
0
5
10
15
20
25
0
0
5
10
15
20
25
0
0
6
8
↓
0.5
5
4 Number of form
↓
0.6
0
2
Number of form
0.7
0
20
0.3
Number of form
↓
15
×
0.8
0.4
10 I
0.7
0.4
5
I
0.8
Weight
Weight
15 I
0.2
0.1
5
10
15
20
25
0
0
5
10
15
20
25
Fig. 4. Top row: original 2D objects; Second row: extracted invariants {I2 , I3 , . . . , I8 } for each object; Third row: weights {w2 , w3 , . . . , w8 }; Bottom row: weighted invariants {w2 I2 , w3 I3 , . . . , w8 I8 }.
cr is, the more the least-squares error decreases, and thus the current step has more contribution to the whole fitting precise. Therefore the corresponding invariant vector Ir can be weighted by wr in Eq. (9) as 2 , wr = cr 2 / b
(12)
which is obtained at the where we normalize the weights by right-hand vector b final step. Fig. (4) illustrates the above process by encoding the invariants to 5 samples. 3.4
Dissimilarity Evaluation
For practical object recognition, three factors are considered for accurate shape discrimination: weighting invariants, determined fitting degree and final fitting accuracy. Thus, a more robust dissimilarity evaluation for discriminating two objects O1 and O2 can be described as: Dis(O1 , O2 ) = α I1 − I2 +β|D1 − D2 | + γ E1 − E2 ,
(13)
Multilevel Algebraic Invariants Extraction by Incremental Fitting Scheme
197
where I1 and I2 are invariant vectors defined in Eq. (9); D1 and D2 are obtained fitting degrees; and E1 and E2 are the final fitting errors defined by Zheng et al. in [16] respectively for two objects. Also they are controlled by three weighting parameters: α, β and γ.
4 4.1
Experimental Result Experiment on Object Recognition
In this experiment on object recognition, we adopted 2D shape database LEMS216, consisting of 216 samples divided into 16 categories [18]. But the database was too small for examining effectiveness, and we extended LEMS216 to a larger database of 2160 samples by adding 9 variations for each of original samples in LEMS216, through adding noises, missing data, and random Euclidean transformation. Some selected original and generated samples are shown in Fig. 5. Note, all the shape data sets are regularized by centering the data-set center of mass at the origin of the coordinate system and scaling it by dividing each point by the average length from points to origin. We generated another 100 samples with a similar operation for test data set, and then we searched for their matches in the database. We tested three methods: two degree-fixed fitting methods: 3L method [13] and Rigid Regression (RR) method [14] and our method, and for both degree-fixed methods same parameters were employed for 3L method (ε = 0.05) and RR method (κ = 10−4 ). For the degree-fixed fitting method, the invariants were extracted from the leading form of a degree-fixed IP by using Taubin and Cooper’s method described in Appendix. In Tab. 1, we show comparable results on recognition rate in different cases by changing degree for fixed methods and setting different parameters for Eq. 13 for our method.
Fig. 5. Top 3 rows: original samples in LEMS216. Bottom row: randomly generated samples by adding noise and missing data and operated by Euclidean transformation
198
B. Zheng, J. Takamatsu, and K. Ikeuchi
Table 1. The recognition rate comparison for the fixed fitting method and our method degree 3L method recognition rate RR method recognition rate parameters: α β γ our method recognition rate
4.2
4 82.1% 81.1% 1 0 0 92.3%
6 87.7% 91.0% 1 0.1 0 95.1%
8 86.4% 92.1% 1 0 0.2 94.3%
10 81.3% 85.3% 1 0.1 0.2 97.8%
Experiment on Action Recognition
M. Blank et al. [19] proposed a method for human action recognition that models human actions from video data. To this application, we use IPs to model the actions defined by continuous silhouettes extracted from video frames (see Fig. 6). IP has the capability to model global features of objects that can make the action recognition feasible for fast implementation. We detected three actions, “walking”,“running” and “jumping”, from ten videos provided by M. Blank et al. [19]. Our fast invariants extraction method brought better performance that the overall processing time (incrementally modeling and extracting invariants) for each detection can be completed within about 2.5 seconds, compared to the result (30 seconds) in [19].
O t O
Fig. 6. Left: video frames for silhouettes extraction. the others: extracted 3D models of “walking”,“running” and “jumping” from three videos.
5
Conclusions
We represent a method that can extract algebraic invariants in a multilevel manner by a from-by-form incremental IP fitting scheme. The quality of the invariants is demonstrated by the adaptive extraction according to the complexity of shapes. Furthermore the performance is enhanced by combining a weighting method, the information of degrees and fitting accuracy. It has capability to cope with 2D/3D recognition and potentials for real-time applications.
Acknowledgment This work was partially supported by Canon Inc. We gratefully acknowledge the data provided by Lena Gorelick et al. [19] for action recognition.
Multilevel Algebraic Invariants Extraction by Incremental Fitting Scheme
199
Appendix: Taubin and Cooper’s Invariants Let us briefly describe how to extract invariants vector Ir from the r-th form vector pr once it has been solved out by Eq. (8). Taubin and Cooper [8] proposed a simple symbolic computation to achieve this. It can be described as follows. Φijk Let coefficient aijk in pr (also see Eq. (1)) be presented as i!j!k! and the coefficient vector of the r-th form of a polynomial be presented as Φ[r] = r00 √Φr−1,1,0 . . . √Φ00r )T . Then there exists a matrix Φ[s,t] whose singu( √Φr!0!0! 0!0!r! (r−1)!1!0!
lar values are invariant under Euclidean transformation, according to a symbolic computational manner: Φ[s,t] = Φ[s] ΦT[t] ,
(14)
where represents the classic matrix multiplication with the difference that the individual element-wise multiplications are performed according to the rule Φ Φ √ ijk √Φabc = √i+a,j+b,k+c . For example, if i!j!k! i!j!k!a!b!c! a!b!c! Φ100 Φ010 Φ001 T √ √ Φ[1] = ( √ ) , 1!0!0! 0!1!0! 0!0!1! then
⎛
Φ[1,1] = Φ[1] ΦT[1]
⎞ ⎛ ⎞ Φ200 Φ110 Φ101 2a200 a110 a101 = ⎝ Φ110 Φ020 Φ011 ⎠ = ⎝ a110 2a020 a011 ⎠ . Φ101 Φ011 Φ002 a101 a011 2a002
(15)
(16)
Then the eigenvalues (singular values for non-square matrix) of Φ[1,1] are invariant under Euclidean transformation. Also we can see that matrix Φ[1,1] can be constructed by only using the coefficients in p2 = {a200 , a110 , a101 , a020 , a011 , a002 } of the 2nd form. Therefore, as a result, if the r-th form of coefficients pr is given, we can construct a matrix Φ[s,t] (s + t = r) and calculate out the singular values by SVD worked as the Euclidean invariants. Furthermore, the normalized singular values are scaling invariant. For more details and proofs, let us refer to [8]. Note, in this paper, as a matter of convenience, we only employ the invariants from r-th degree form which are eigenvalues of Φ[r/2,r/2] if r is even, or singular values of Φ[(r−1)/2,(r+1)/2] otherwise. Let us denote this symbolic computation by a function Inv(·): Ir = Inv(pr ), (17) where Ir is the returned invariants holding the singular values of Φ.
References 1. Keren, D.: Using symbolic computation to find algebraic invariants. IEEE Trans. on Patt. Anal. Mach. Intell. 16(11), 1143–1149 (1994) 2. Tarel, J., Cooper, D.: The Complex Representation of Algebraic Curves and Its Simple Exploitation for Pose Estimation and Invariant Recognition. IEEE Trans. on Patt. Anal. Mach. Intell. 22(7), 663–674 (2000)
200
B. Zheng, J. Takamatsu, and K. Ikeuchi
3. Taubin, G.: Estimation of Planar Curves, Surfaces and Nonplanar Space Curves Defined by Implicit Equations with Applications to Edge and Range Image Segmentation. IEEE Trans. on Patt. Anal. Mach. Intell. 13(11), 1115–1138 (1991) 4. Subrahmonia, J., Cooper, D., Keren, D.: Practical reliable bayesian recognition of 2D and 3D objects using implicit polynomials and algebraic invariants. IEEE Trans. on Patt. Anal. Mach. Intell. 18(5), 505–519 (1996) 5. Oden, C., Ercil, A., Buke, B.: Combining implicit polynomials and geometric features for hand recognition. Pattern Recognition Letters 24(13), 2145–2152 (2003) 6. Marola, G.: A Technique for Finding the Symmetry Axes of Implicit Polynomial Curves under Perspective Projection. IEEE Trans. on Patt. Anal. Mach. Intell. 27(3) (2005) 7. Wolovich, W.A., Unel, M.: The Determination of Implicit Polynomial Canonical Curves. IEEE Trans. on Patt. Anal. Mach. Intell. 20(10), 1080–1090 (1998) 8. Taubin, G., Cooper, D.: Symbolic and Numerical Computation for Artificial Intelligence. In: Computational Mathematics and Applications, ch. 6. Academic Press, London (1992) 9. Forsyth, D., Mundy, J., Zisserman, A., Coelho, C., Heller, A., Rothwell, C.: Invariant Descriptors for 3D Object Recognition and Pose. IEEE Trans. on Patt. Anal. Mach. Intell. 13(10), 971–992 (1991) 10. Zheng, B., Takamatsu, J., Ikeuchi, K.: 3D Model Segmentation and Representation with Implicit Polynomials. To appear in IEICE Trans. on Information and Systems E91-D(4) (2008) 11. Zheng, B., Ishikawa, R., Oishi, T., Takamatsu, J., Ikeuchi, K.: 6-DOF Pose Estimation from Single Ultrasound Image using 3D IP Models. In: Proc. IEEE Conf. Computer Vision and Patt. Rec (CVPR) Workshop on Ojb. Trac. Class. Beyond Visi. Spec., OTCBVS 2008 (2008) 12. Tarel, J.P., Civi, H., Cooper, D.B.: Pose Estimation of Free-Form 3D Objects without Point Matching using Algebraic Surface Models. In: Proceedings of IEEE Workshop Model Based 3D Image Analysis, pp. 13–21 (1998) 13. Blane, M., Lei, Z.B., Cooper, D.B.: The 3L Algorithm for Fitting Implicit Polynomial Curves and Surfaces to Data. IEEE Trans. on Patt. Anal. Mach. Intell. 22(3), 298–313 (2000) 14. Tasdizen, T., Tarel, J.P., Cooper, D.B.: Improving the Stability of Algebraic Curves for Applications. IEEE Trans. on Imag. Proc. 9(3), 405–416 (2000) 15. Sahin, T., Unel, M.: Fitting Globally Stabilized Algebraic Surfaces to Range Data. In: Proc. IEEE Conf. Int. Conf. on Comp. Visi., vol. 2, pp. 1083–1088 (2005) 16. Zheng, B., Takamatsu, J., Ikeuchi, K.: Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part II. LNCS, vol. 4844, pp. 289–300. Springer, Heidelberg (2007) 17. Unsalan, C.: A Model Based Approach for Pose Estimation and Rotation Invariant Object Matching. Pattern Recogn. Lett. 28(1), 49–57 (2007) 18. http://www.lems.brown.edu/vision/software/index.html 19. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Proc. IEEE Conf. Int. Conf. on Comp. Visi., pp. 1395–1402 (2005)
Towards Robust Object Detection: Integrated Background Modeling Based on Spatio-temporal Features Tatsuya Tanaka1 , Atsushi Shimada1 , Rin-ichiro Taniguchi1 , Takayoshi Yamashita2 , and Daisaku Arita1,3 1
3
Kyushu University, Fukuoka, Japan 2 OMRON Corp. Kyoto, Japan Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan
Abstract. We propose a sophisticated method for background modeling based on spatio-temporal features. It consists of three complementary approaches: pixel-level background modeling, region-level one and framelevel one. The pixel-level background model uses the probability density function to approximate background model. The PDF is estimated non-parametrically by using Parzen density estimation. The region-level model is based on the evaluation of the local texture around each pixel while reducing the effects of variations in lighting. The frame-level model detects sudden, global changes of the the image brightness and estimates a present background image from input image referring to a background model image. Then, objects are extracted by background subtraction. Fusing their approaches realizes robust object detection under varying illumination, which is shown in several experiments.
1
Introduction
Background subtraction technique has been often applied to detection of objects in images. It is quite useful because we can get object regions without prior information about the objects by subtracting a background image from an observed image. However, simple background subtraction often detects not only objects but also a lot of noise regions, because it is quite sensitive to small illumination changes caused by moving clouds, swaying tree leaves, water ripples glittering in the light, etc. To handle these background changes, many techniques have been proposed [1,2,3,4,5,6,7,8,9]. In general, they are classified into three categories: pixel-level, region-level and frame-level background modelings. pixel-level modeling. In most of the pixel-level background modelings, the distribution of each pixel value is described in a probabilistic framework[8,1]. A Typical method is Elgammal’s where the probability density function is estimated by Parzen density estimation in a non-parametric form. Probabilistic methods construct the background model referring to observed images in H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 201–212, 2010. c Springer-Verlag Berlin Heidelberg 2010
202
T. Tanaka et al.
the past, and, therefore, they are effective for repetitive brightness changes due to fluctuation of illumination, swaying tree leaves etc. However, they can not handle sudden illumination changes correctly, which are not observed in the previous frames region-level modeling. Methods based on Radial Reach Correlation (RRC) and Local Binary Pattern (LBP) are typical region-level, or texture-level, background modelings[4,10]. In those methods, local texture information around a pixel, which is represented in terms of magnitude relation of the pixel and its peripheral pixels, is described. Usually, this magnitude relation does not change even if the illumination condition changes, and, thus, region-level background modeling often gives us more robust results than the pixel-level modeling. However, local changes of brightness, such as the change due to tree leave swaying, can not be handled correctly. frame-level modeling. Fukui et al have proposed a method to estimate the current background image from the modeled background image[11]. In their method, background candidate regions are extracted from the current image and the background generation function is estimated referring to the brightness of the candidate regions. The generation function is estimated under the assumption that the illumination changes uniformly in the image, and, therefore. nonuniform illumination changes can not be handled. In addition, the modeled background image is acquired in advance, and, unpredicted changes of the background can not be handled as well. As mentioned above, each approach has merits and demerits depending on the assumptions of characteristics of the background and the illumination. Therefore, to achieve more robust object detection, or to acquire more effective background model, we should combine adaptively background modelings having different characteristics. In this paper, we propose integrated background modeling combining the pixel-level, the region-level and the frame-level background modelings.
2
Basic Background Modeling
Before describing the integrated modeling, we discuss improvement of the three basic modelings here, which provides better performance of the integrated modeling. 2.1
Pixel-Level Background Modeling
As pixel-level background modeling, probabilistic modeling of pixel value is the most popular method, in which fast and accurate estimation of the probability density function (PDF) of each pixel value is quite important. To estimate the PDF, Parzen density estimation is quite effective. However, to acquire an accurate estimation, a enough number of samples are required, and a popular method[1] requires a lot of computation, which is proportional to the number of samples. Therefore, it can be hardly applied to real-time processing. To solve this
Towards Robust Object Detection: Integrated Background Modeling
203
problem, we have designed a fast algorithm of PDF estimation. In our algorithm, we have used a rectangular function as the kernel function, instead of Gaussian function, which is often used in Parzen density estimation, and have designed incremental updating calculation to estimate the PDF, which requires constant computation time independent from the number of samples. This enables us to realize relatively robust real-time background modeling. Please refer to [12] for its detailed algorithm. 2.2
Region-Level Background Modeling
To realize robust region-level background modeling, we have improved Radial Reach Correlation (RRC)[4] so that the background model is updated according to the background changes of the input image frames. Radial Reach Correlation (RRC)[4] Each pixel is judged as either the foreground or the background based on Radial Reach Correlation (RRC), which is defined to evaluate local texture similarity without suffering from illumination changes. RRC is calculated at each pixel (x, y). At first, pixels whose brightness differences to the brightness of the center pixel (x, y) exceed a threshold are searched for in every radial reach extension in 8 directions around the pixel (x, y). The searched 8 pixels are called as peripheral pixels hereafter. Then, the signs of brightness differences (positive difference or negative difference) of the 8 pairs, each of which is a pair of one of eight peripheral pixels and the center pixel (x, y), are represented in a binary code. The basic idea is that the binary code represents intrinsic information about local texture around the pixel, and that it does not change under illumination changes. To make this idea concrete, the correlation value of the binary codes extracted from the observed image and the reference background image is calculated to evaluate their similarity. Suppose that the position of a pixel is represented as a vector p = (x, y), and that the directional vectors of radial reach extensions are defined as d0 = (1, 0)T , d1 = (1, 1)T d2 = (0, 1)T , d3 = (−1, 1)T , d4 = (−1, 0)T , d5 = (−1, −1)T d6 = (0, −1)T , and d7 = (1, −1)T . Then the reaches {rk }7k=0 for these directions are defined as follows referring to the reference image f , or the background image here: rk = min{r| |f (p + rdk ) − f (p)| ≥ TP } (1) where f (p) represents the pixel value of the position of p in the image f , and TP represents the threshold value to detect a pixel with different brightness. Based on the brightness difference between the center pixel and the peripheral pixels (defined by equation (1)), the coefficients of incremental encoding, or polarity encoding, of the brightness distribution around the pixel in the background image f is given by the following formula: 1 if f (p + rk dk ) ≥ f (p) bk (p) = (2) 0 otherwise
204
T. Tanaka et al.
where k = 0, 1, . . . , 7. In the same manner, the incremental encoding strings are calculated for the input image g, except that the reach group {rk }7k=0 is established in the background image f , not in the input image g. 1 if g(p + rk dk ) ≥ g(p) bk (p; g) = (3) 0 otherwise Based on the obtained bk (p), bk (p), the number of matches (correlation), B(p), between the two incremental encodings is calculated as follows. B(p) =
7
{bk (p) · bk (p) + bk (p) · bk (p)}
(4)
k=0
where x = 1−x represents the inversion of a bit x. B(p) represents the similarity, or correlation value, of the brightness distribution around the pixel p in the two images, and it is called Radial Reach Correlation (RRC). Since RRC between an input image pixel and its corresponding background image pixel represents their similarity, it can be used as a measure to detect foreground pixels. That is, pixels whose RRC are smaller than a certain threshold TB can be judged as foreground pixels. Background modeling based on Adaptive RRC Using RRC, the similarity between incremental encodings of a background image pixel and its corresponding pixel in the observed image is calculated, and pixels which are not “similar” to their corresponding pixels in the background image are detected as foreground pixels. In principle, if the background does not change, we can prepare adequate encodings of the background image in advance. However, usually, due to the illumination changes and various noises, it is almost impossible to prepare them. Even if we manage to prepare such fixed background encodings, accurate results can not be acquire, and, therefore, we should update the background encodings properly. Here, we have developed a mechanism to update the background encodings according to the following formula: t t bt+1 k (p) = (1 − α) · bk (p) + α · bk (p)
(5)
btk (p)(k
where = 0, 1, . . . , 7) represents the incremental encoding of a pixel p at time t. α is a learning rate, and when it is large enough the above encoding can be quickly adapted to the current input image, i.e., adapted to sudden background changes. The range of btk (p) is [0, 1], and when btk (p) is close to either 0 or 1, it means that the magnitude relation of brightness between the center pixel and its peripheral pixel does not change. Otherwise, i.e., if btk (p) is close to 0.5, the magnitude relation is not stable. According to this consideration, a peripheral pixel is sought again when Tr ≤ btk (p) ≤ 1 − T r holds. Tr is a threshold value to invoke re-searching of peripheral pixels Then, we re-define the similarity of the incremental encodings as follows: B t (p) =
7 k=0
|btk (p) − bt k (p)|
(6)
Towards Robust Object Detection: Integrated Background Modeling
205
Similarity or dissimilarity is judged by comparing the similarity value with a threshold TB . Detailed procedure of background modeling is summarized as follows: Step1. The incremental encodings of the current frame g are calculated, and foreground pixels are discriminated from background pixels according to the similarity (defined in (6)) of the encodings of the input image pixels and those of the background pixels. Step2. The incremental encodings of the background pixels are updated according to (5). Step3. When Tr ≤ btk (p) ≤ 1 − T r becomes to hold, its peripheral pixel is sought again in the current frame, and btk (p) is re-initialized using the newly found peripheral pixel. 2.3
Frame-Level Background Modeling
Frame-level background modeling proposed by Fukui et al[11], which is based on brightness normalization of a model background image, is designed to be robust against sudden illumination changes. They assume that the illumination changes occur uniformly in the entire image, and, therefore, in principle, their method can not detect objects robustly under non-uniform illumination changes. In addition, since the background image is prepared in advance, it can not handle unexpected background changes as well. In our method, on the contrary, the brightness of the model background image can be normalized even if the illumination changes non-uniformly, and on-line training, by which we can cope with the background changes, is also integrated. Then, objects can be simply detected by subtracting the normalized background image from an observed image. To realize the robust brightness normalization, we have designed a multi-layered perceptron, by which the mapping between the brightness of pixels of the model background image and that of an observed image is established. To reflect the locality of the brightness distribution, the input vector of the perceptron consists of the coordinates (x, y) and the brightness value (R, G, B) of a pixel, and this combination can handle the nonuniform illumination changes. Detailed procedure of the frame-level background modeling is as follows Step1. Learning: the mapping between an input vector, (x, y, R, G, B), of a pixel in a model background image and an output vector, (R , G , B ) of the pixel at the same position in the observed image is learned. (x, y) is the coordinates of the pixel and (R, G, B), (R , G , B ) are the color brightness values of the pixels. To achieve on-line training, we have to acquire training data frequently, which is achieved in the integration process of the three background modelings. Details will be presented in the next section. Step2. Normalization: the brightness of the model background image is normalized using the perceptron learned in Step1, which means that the background image corresponding to the observed image is estimated.
206
T. Tanaka et al.
Step3. Object detection: subtraction of the normalized background image, which is estimated in Step2, from the observed image gives us the object detection result. That is, pixels which have large brightness difference (larger than a given threshold, Tdet ) are detected as foreground pixels, or object pixels.
3
Integrated Background Modeling Based on Spatio-temporal Features
Finally, we present our major contribution, i.e., integrated background modeling based on spatio-temporal features, which is realized by integrating pixel-level, region-level and frame-level background modelings. Its detailed processing flow is as follows. Step1. Objects are detected based on the pixel-level background modeling. Step2. Objects are detected based on the region-level background modeling. Step3. Object detection results of Step1 and Step2 are combined. That is, pixels which are judged as foregrounds by both of the above modelings are judged as foregrounds and other pixels are judged as backgrounds. Then, the parameters of the background models are modified. First, the PDF of the pixel value of the input images, which is maintained in the pixel-level background model, is updated. In addition, when a pixel is judged as a background pixel here, the parameters of region-level background model are modified. Step4. When the brightness difference between the current frame and the previous frame is large at a certain number of pixels, we establish TTL (Time To Live) to the frame-level background model, and TTL represents the duration where the frame-level model is activated. By using TTL, we activate the frame-level model only when the illumination condition suddenly changes. Step5. If TTL > 0, objects are finally detected based on the frame-level background model and TTL is decreased. Otherwise, object detection result acquired in Step3 is adopted. The frame-level background modeling is achieved as follows: (5-1). A model background image is generated so that each pixel has the most frequent pixel value in its PDF. The PDF is maintained in the pixel-level background model. (5-2). Training samples to adjust the model background image are selected from pixels judged as background in Step3. This is because, we have found experimentally that the object detection result of Step3 has little false negatives, and pixels judged as background (called as background candidate pixels (BCPs)) have little misidentification. Therefore, at BCPs, the correspondences of pixel values of the model background image and those of the observed image can become adequate training samples for background normalization. In practice, at each frame, 100∼200 pixels out of BCPs are randomly sampled and used as training samples.
Towards Robust Object Detection: Integrated Background Modeling
207
(5-3). After the multi-layered perceptron is trained at every frame, referring to the training samples acquired in (5-2), the model background image is adjusted using the perceptron, and, the subtraction of the adjusted model background image from the observed image becomes the final object detection result.
4
Experiment
To evaluate the performance of our method, we have used data set of PETS (PETS2001)1 (see Fig.1(a) and Fig.1(b)), which are often used in video-based surveillance, and data set publicly available for evaluation of object detection2 (see Fig.1(c) and Fig.1(d)). Fig.1(a) and Fig.1(b) are outdoor scenes where people are passing through streets, tree leaves are flickering, and the weather conditions change rapidly. Fig.1(c) includes sharp pixel value changes caused by camera aperture changes. Fig.1(d) includes sudden illumination changes caused by turning on/off of the light.
(a) Outdoor scene 1
(b) Outdoor scene 2
(c) Indoor scene 1
(d) Indoor scene 2
Fig. 1. Images Used for Evaluation
4.1
Region-Level Background Modeling
Fig.2 comparatively shows the performance of our adaptive RRC method. Fig.2 (a), (b), (c) show an input image, the result acquired by RRC without researching of peripheral pixels, and the one by our improved RRC with researching, respectively. The figure shows that our improved RRC gives us better result. Green and red marks in the input image show typical examples of peripheral pixels which are sought again: green crosses are center pixels, red dots are their initial peripheral pixels, and green dots are peripheral pixels which 1 2
Benchmark data of International Workshop on Performance Evaluation of Tracking and Surveillance. Available from ftp://pets.rdg.ac.uk/PETS2001/ Several kinds of test images and their ground truth is available from //limu.ait.kyushu-u.ac.jp/dataset/
208
T. Tanaka et al.
(a) Input image
(b) RRC without researching
(c) improved RRC with re-searching
Fig. 2. Effect of our improved RRC
are found in the re-searching process. At these initial peripheral pixels, the incremental encodings are largely changed due to sudden changes of pixel value distribution, and, therefore, without the re-searching process, background pixels are detected as foreground pixels. 4.2
Frame-Level Background Modeling
We have evaluated the performance of background image estimation using test data, an example of which is shown in Fig.3: the left column indicates an observed image, the center column does the model background image, and the right column does hypothetic foreground-candidate regions which are manually established for this experiment. In the observed images, there are non-uniform illumination changes due to turning on/off of the light. We have examined three algorithms: estimation by Fukui et al[11], three layered perceptron which accepts pixel position as its input, and one without the pixel position. Training data given to the algorithms consists of input vectors (X, Y, R, G, B) randomly sampled from background-candidate regions in the observed image and their corresponding output vectors (R , G , B ) at the same pixel positions in the
Fig. 3. Experimental data for background estimation
Towards Robust Object Detection: Integrated Background Modeling
209
Table 1. Result of background image estimation average error std. deviation Fukui[11] NN (R, G, B) NN (X, Y, R, G, B)
9.5 8.5 5.4
13.9 10.4 9.3
model background image. The number of samples here is 100. The perceptron consists of 5 nodes in the input layer (3 nodes when the pixel position is not referred to), 3 nodes in the output layer and 3 nodes in the middle layer. Table 1 shows the accuracy of estimation, i.e., the error between the observed image and the image estimated from the model background image. It indicates that our method, three layered perceptron accepting pixel position information, is better than other methods. This is because, referring to the pixel position information, our method can robustly estimate the background image under non-uniform illumination changes. The remaining problem here is that if the dynamic range of the model background image is rather small (i.e., the model background image is acquired in a dark situation) the estimated one is not very accurate and the object detection performance is slightly degraded. 4.3
Integrated Background Modeling
In this experiment, we have used the following parameters: Pixel-level the width of the rectangular kernel is 9 and the number of samples is 500. See [9] for the details. Region-level Tp = 10, TB = 5.5, Tr = 0.3 in section 2.2
(a) Input image
(b) Most pixel values
probable
(d) Pixel-level background modeling
(e) Pixel and regionlevel modelings
(c) Estimated ground model
back-
(f) Integration of the three modelings
Fig. 4. Effects of integration of different background modelings
210
T. Tanaka et al. Table 2. Comparative results of object detection accuracy Outdoor1 Outdoor2 Indoor1 Indoor2 Recall Precision Recall GMM [8] Precision Recall Parzen [9] Precision Recall Proposed method Precision RRC [4]
37.5 22.4 61.3 58.2 56.3 51.6 77.9 69.1
% % % % % % % %
24.8 20.7 34.9 55.0 46.8 72.8 65.5 69.7
% % % % % % % %
35.3 51.2 38.1 59.7 43.0 42.0 73.9 92.1
% % % % % % % %
26.9 24.9 35.6 46.1 37.8 58.5 76.1 77.4
% % % % % % % %
Frame-level Tdet = 30. The initial value of TTL is 60, and the initial value has been decided according to the time-lag which is necessary for the pixel-level background model to adjust illumination changes. Actually, it is set to be twice as the time-lag in this experiment. Fig.4 illustrates the effect of the integration, showing that the three different background modelings well complement one another We have evaluated the accuracy of object detection in terms of precision and recall, comparing the proposed method, integration of the three background modelings, with RRC, fast Parzen estimation[9], adaptive Gaussian mixture model[8]. Table 2 shows the precision and recall of detection results acquired by each background modeling. According to Table 2, it is clearly shown that our proposed method outperforms the methods based on RRC, GMM and the Parzen density estimation. Fig.5 shows their object detection example. Considering the experimental results, we can see the following characteristics: – RRC can handle illumination changes, but it can not handle background changes such as changes due to moving clouds. This is because RRC employs a fixed reference image3 , and texture information which does not appear in the initial frame can not be handled correctly. Therefore, background pixels are incorrectly detected as foreground pixels. – The methods based on Parzen density estimation and GMM can handle such background changes, but can not handle the sudden illumination changes correctly. – Our proposed method can detect object robustly against both of background changes and illumination changes. Finally, we have evaluated our method using Wallflower dataset, in which images and their ground truth data for various background subtraction issues are included. Table 3 shows the result, in which accuracy of the methods other than ours is cited from the Wallflower paper[2]. Although this result indicates 3
In this experiment, incremental encodings for reference are generated from the initial frame.
Towards Robust Object Detection: Integrated Background Modeling
(a) Input images
(b) Ground truth
(c) Proposed
211
(e) GMM
Fig. 5. Experimental results Table 3. Performance evaluation using Wallflower dataset Moved Time of Light Waving Object Day Switch Trees Mean + Covariance GMM Block Correlation Temporal Derivative Bayesian Decision Eigenbackground Linear Prediction Wallflower Proposed Method
FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP
0 0 0 0 0 1200 0 1563 0 0 0 1065 0 0 0 0 0 0
949 535 1008 20 1030 135 1151 11842 1018 562 879 16 961 25 961 25 1349 0
1857 15123 1633 14169 883 2919 752 15331 2380 13439 962 362 1585 13576 947 375 1681 1396
3110 357 1323 341 3323 448 2483 259 629 334 1027 2057 931 933 877 1999 198 771
Camou- Boot- Foreground flage strap Aperture 4101 2040 398 3098 6103 567 1965 3266 1538 2130 350 1548 1119 2439 229 2706 177 342
2215 92 1874 217 2638 35 2428 217 2143 2764 304 6129 2025 365 2025 365 1235 199
3464 1290 2442 530 1172 1230 2049 2861 2511 1974 2441 537 2419 649 320 649 2085 658
FN: False Negative, FP: False Positive.
that the performance of our method is almost the same as that of Wallflower, our method requires no off-line training and it is much more useful than Wallflower, which requires advance learning of background images.
5
Conclusion
In this paper, we have presented integrated background modeling based on spatio-temporal features and robust object detection based on this modeling. By integrating several background modelings having different characteristics, we can establish more robust background model against variety of background and illumination changes.
212
T. Tanaka et al.
Our future works are summarized as follows: – Improvement of the estimation accuracy of background image Background image estimation in Step5 of section 3 should be more accurate. Since this accuracy directly affects object detection result, the accurate estimation is inevitable. Currently, when brighter images are estimated from the model background image, the accuracy tends to become low. – Stabilization of computation time Computation time required in the frame-level background modeling becomes larger compared with other background modelings. This is partly because the training of the perceptron is executed at every frame. In addition, the time varies depending on input images. Therefore, to realize a practical online system, reduction and stabilization of its computation time is quite important.
References 1. Elgammal, A., et al.: Non-parametric model for background subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 751–767. Springer, Heidelberg (2000) 2. Toyama, K., et al.: Wallflower: Principle and practice of background maintenance. In: Proc. of Int. Conf. on Computer Vision, pp. 255–261 (1999) 3. Li, L., et al.: Statistical Modeling of complex background for foreground object detection. IEEE Tran. on Image Processing 13(11), 1459–1472 (2004) 4. Satoh, Y., et al.: Robust object detection using a radial reach filter (RRF). Systems and Computers in Japan 35(10), 63–73 (2004) 5. Monari, E., et al.: Fusion of background estimation approaches for motion detection in non-static backgrounds. In: CD-ROM Proc. of IEEE Int. Conf. on Advanced Video and Signal Based Surveillance (2007) 6. Ukita, N.: Target-color learning and its detection for non-stationary scenes by nearest neighbor classification in the spatio-color space. In: Proc. of IEEE Int. Conf. on Advanced Video and Signal based Surveillance, pp. 394–399 (2005) 7. Stauffer, C., et al.: Adaptive background mixture models for real-time tracking. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 246–252 (1999) 8. Shimada, A., et al.: Dynamic control of adaptive mixture-of-gaussians background model. In: CD-ROM Proc. of IEEE Int. Conf. on Advanced Video and Signal Based Surveillance (2006) 9. Tanaka, T., et al.: A fast algorithm for adaptive background model construction using Parzen density estimation. In: CD-ROM Proc. of IEEE Int. Conf. on Advanced Video and Signal Based Surveillance (2007) 10. Zhang, S., et al.: Dynamic background modeling and subtraction using spatiotemporal local binary patterns. In: Proc. of IEEE Int. Conf. on Image Processing, pp. 1556–1559 (2008) 11. Fukui, S., et al.: Extraction of moving objects by estimating background brightness. Journal of the Institue of Image Electronics Engineers of Japan 33(3), 350–357 (2004) 12. Tanaka, T., et al.: Non-parametric background and shadow modeling for object detection. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part I. LNCS, vol. 4843, pp. 159–168. Springer, Heidelberg (2007)
Image Enhancement of Low-Light Scenes with Near-Infrared Flash Images Sosuke Matsui1 , Takahiro Okabe1 , Mihoko Shimano1,2 , and Yoichi Sato1 1
Institute of Industrial Science, The University of Tokyo 2 PRESTO, Japan Science and Technology Agency {matsui,takahiro,miho,ysato}@iis.u-tokyo.ac.jp
Abstract. We present a novel technique for enhancing an image captured in low light by using near-infrared flash images. The main idea is to combine a color image with near-infrared flash images captured at the same time without causing any interference with the color image. In this work, near-infrared flash images are effectively used for removing annoying effects that are commonly observed in images of dimly lit environments, namely, image noise and motion blur. Our denoising method uses a pair of color and near-infrared flash images captured simultaneously. Therefore it is applicable to dynamic scenes, whereas existing methods assume stationary scenes and require a pair of flash and no-flash color images captured sequentially. Our deblurring method utilizes a set of near-infrared flash images captured during the exposure time of a single color image and directly acquires a motion blur kernel based on optical flow. We implemented a multispectral imaging system and confirmed the effectiveness of our technique through experiments using real images.
1
Introduction
When taking a picture in low light, photographers usually face the dilemma of using flash or not. The quality of an image captured without flash is often degraded by noise and motion blur. On the other hand, noise and motion blur in an image captured with flash are significantly reduced. However, flash causes undesired artifacts such as flat shading and harsh shadows. As a result, the atmosphere of the original scene evoked by dim light is destroyed. Thus, there are positive and negative points with using flash. We propose two methods for enhancing an image captured in low light according to the following two scenarios. In the first scenario, we reduce the noise of a color image captured in low light with a short exposure time, since the image is not blurry, but contains a significant amount of noise due to large gain or high ISO. In the second scenario, we remove motion blur of a color image captured in low light with a long exposure time, since the image is not noisy, but is blurry due to camera shake or scene motion. The main idea of our methods is to combine a color image captured without flash and additional near-infrared (NIR) images captured with NIR flash for reducing noise and motion blur in the color image. Because the spectrum of NIR H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 213–223, 2010. c Springer-Verlag Berlin Heidelberg 2010
214
S. Matsui et al.
light is different from that of visible light, we can capture both a color image and NIR images at the same time without causing any interference by using a multispectral imaging system composed of a color camera and an NIR camera. In addition, NIR flash provides sufficient amount of light in the NIR spectrum, thus suppressing noise and motion blur in the NIR images. Our denoising method uses a pair of color and NIR flash images captured simultaneously, which is applicable to dynamic scenes, whereas existing methods [1,2,3] assume stationary scenes and require a pair of flash and no-flash color images captured sequentially. More specifically, we first decompose the color image into a large-scale image (low-frequency components) and a detail image (high-frequency components); the former mainly includes global textures and shading caused by lighting, and the latter mainly includes subtle textures, edges, and noise. Then, taking the difference in spectrum into consideration, we carefully denoise the detail image by using a novel algorithm termed joint non-local mean algorithm which is a multispectral extension of a non-local mean algorithm [4]. Finally, we combine the large-scale and the revised detail images and obtain a denoised color image. We experimentally show that our method works better than Bennett’s method [5] which also uses NIR images to reduce noise in a video shot in a dimly lit environment. Our deblurring method uses a set of NIR flash images captured during the exposure time of a single color image and directly acquires a motion blur kernel based on optical flow in a similar manner to Ben-Ezra’s method [6], which combines videos with different temporal and spatial resolutions. Then, the Richardson-Lucy deconvolution algorithm [7,8] is used for deblurring the color image. We demonstrate that combining images with different temporal resolutions is effective also for deblurring an image captured in low light by incorporating a multispectral imaging system. The rest of this paper is organized as follows. We briefly summarize related work in Section 2. We introduce our denoising and deblurring methods in Section 3. We present experimental results in Section 4 and concluding remarks in Section 5.
2
Related Work
We briefly summarize previous studies related to our technique from two distinct points of view: denoising and deblurring. Denoising Petschnigg et al. [1] and Eisemann and Durand [2] independently proposed methods for denoising an image taken in low light by using a pair of flash and no-flash images captured using a single color camera. They combine the strengths of flash and no-flash images; flash captures details of a scene, and no-flash captures ambient illumination. More specifically, they decompose the no-flash image into a large-scale image and a detail image by using a bilateral filter [9]. Then, they revise the noisy detail image by transferring the details of the scene from the flash image. They recombine the large-scale and the revised detail images and
Image Enhancement of Low-Light Scenes with Near-Infrared Flash Images
215
finally obtain a denoised image. Agrawal et al. [3] made use of the fact that the orientation of image gradient is insensitive to illumination conditions and proposed a method for removing artifacts, such as highlights, caused by flash. However, these methods share common limitations. That is, they assume stationary scenes and require a pair of flash and no-flash color images captured sequentially. On the other hand, our method uses a color image as well as an NIR flash image, which can be captured at the same time without causing any interference with the color image. The use of an NIR flash image enables us to apply our method to dynamic scenes, which is the advantage of our method. NIR images are used for denoising a video shot in a dimly lit environment [5] and for enhancing contrast and textures of an image of a high-dynamic range scene [10]. Particularly, our method is similar to the former method proposed by Bennett [5] in that it uses a pair of color and NIR images for noise reduction. However, our method differs from it with respect to the manner in which we revise the detail image. Bennett’s method revises the detail image in the visible spectrum by transferring the details from the NIR image. Since it combines intensities observed in different spectra, it causes artifacts such as color shifts. On the other hand, we revise the detail image by non-locally averaging the color image with the weights computed based on the NIR flash image. We experimentally show that our method works better for denoising an image of a low-light scene. Recently, Krishnan and Fergus [11] used dark flash consisting of IR and UV light for denoising an image taken in low light. They achieve dazzle-free flash photography by hiding the flash in invisible spectrum. However, their method also requires a pair of flash and no-flash images captured sequentially, and therefore assumes stationary scenes. Deblurring Yuan et al. [12] proposed an image enhancement method using a pair of images captured in low light using a single color camera successively with long and short exposure times. Their basic idea is denoising the image with the short exposure time and estimating the motion blur kernel of the image with the long exposure time based on the denoised image. They proposed an iterative deconvolution scheme focusing on the residuals of denoising so that ringing artifacts inherent in image deconvolution are reduced. On the other hand, our method using a multispectral imaging system is considered to be a hardware approach to image enhancement. This system captures a set of NIR flash images during the exposure time of a single color image and directly acquires the motion blur kernel based on optical flow. Thus, the use of multispectral images makes deconvolution more tractable. Ben-Ezra and Nayar [6] proposed a hybrid imaging system which captures images of a scene with high spatial resolution at a low frame rate and with low spatial resolution at a high frame rate. They directly measure the motion blur kernel of the image with the low temporal resolution by using the images with high temporal resolution. Recently, Tai et al. [13] extended their method assuming a spatially-uniform blur kernel to deal with spatially-varying blur kernels.
216
S. Matsui et al.
One of the main contributions of our study is to demonstrate that combining images with different temporal resolutions is effective also for deblurring an image of a low-light scene by incorporating a multispectral imaging system.
3
Proposed Methods
We explain our methods for removing noise and motion blur in images of dimly lit environments with the help of NIR flash images. We describe our denoising method in Section 3.1 and our deblurring method in Section 3.2. 3.1
Noise Reduction by Using NIR Flash Image
We explain how noise in a color image captured in low light with a short exposure time is reduced by using an NIR flash image, as shown in Fig.1. First, we decompose the color image into a large-scale image and a detail images by using a dual bilateral filter [5]. The former mainly includes global textures and shading caused by lighting, and the latter mainly includes subtle textures, edges, and noise. We preserve the large-scale image as is so that the shading caused by lighting of a scene is preserved. Second, we denoise the detail image by using our joint non-local mean algorithm so that the details are recovered and noise is reduced. Finally, we recombine the large-scale and the revised detail images and obtain a denoised color image.
INPUT largescale image
detail image
NIR flash image
joint non-local mean algorithm
dual bilateral filter
no-flash color image
OUTPUT
denoised color image denoised detail image
Fig. 1. Flow of our denoising method with help of NIR flash image
Image Enhancement of Low-Light Scenes with Near-Infrared Flash Images
217
Decomposing color image into large-scale and detail images First, we decompose a color image into a large-scale image and a detail image by using the dual bilateral filter [5]. The dual bilateral filter incorporates the weights calculated based on the NIR channel into a conventional bilateral filter [9]. Since the NIR flash image is captured under sufficient amount of light and is not noisy, the dual bilateral filter significantly alleviates the effects of noise contained in the color image. More specifically, we convert the color space of an input image Ic from RGB (c = R, G, B) to YUV (c = Y, U, V). Then, we obtain the large-scale image of the Y component as large scale IY (p) =
1 ZB (p)
GD (p − q)
q∈ΩB (p)
×GNIR (INIR (p) − INIR (q))GY (IY (p) − IY (q))IY (q).
(1)
Here, ZB (p) is a normalization constant, and ΩB (p) is a certain area around a large scale pixel p. INIR (p), IY (p), and IY (p) are intensities at the pixel p in the IR, Y, and large-scale images. GD , GNIR , and GY are the weights calculated with 2 2 the Gaussian functions whose means are zero and variances are σD , σNIR , and 2 σY respectively. large scale As for the U and V channels, we use the bilateral filter and obtain IU large scale and IV . Then, we combine the filtered YUV images and obtain a largescale image Iclarge scale . By dividing the original color image by the large-scale image, we acquire a noisy detail image Icnoisy detail as Icnoisy
detail
(p) =
Ic (p) + large scale Ic (p)
+
,
(2)
where is a small constant for avoiding division by zero.
+ S
3 2 4
+ R 5
+ U
+ T
Fig. 2. Basic idea of our joint non-local mean algorithm. Pixel value I(p) is replaced by weighted average of I(q), I(r), I(s), and so on. Larger weights are assigned to I(q) and I(r) with similar local textures Q and R to P , whereas smaller weight is assigned to I(s) with dissimilar local texture S to P .
218
S. Matsui et al.
Denoising detail image using joint non-local mean algorithm Second, we carefully denoise the noisy detail image by taking the difference in spectra into consideration. In contrast to the existing methods which transfer the details of a scene from a flash image [1,2] or an NIR image [5], we denoise the noisy detail image by non-locally averaging it with the weights computed based on the NIR flash image. More specifically, we assume that the intensity of a certain pixel is similar to the intensity of another pixel if the appearance of patches around the pixels resemble each other (see Fig.2). Then, a detail image Icdetail is acquired as Icdetail (p) =
1 ZN (p)
G(v(p) − v(q))Icnoisy
detail
(q).
(3)
q∈ΩN (p)
Here, ZN (p) is a normalization constant, and ΩN (p) is a search area around the pixel p. We represent the appearance of the patch by concatenating the k × k pixel intensities around the pixel p into a vector v(p). This joint non-local mean algorithm is a multispectral extension of the nonlocal mean algorithm [4], which uses the appearance of patches in the visible spectrum for determining the weights. Since the NIR flash image is captured under sufficient amount of light and is not noisy, our joint non-local mean algorithm works well even when the color image is captured under dim lighting, and as a result, is significantly contaminated by noise. We understand that the joint non-local mean algorithm may degrade because the NIR flash image captures the radiance of the scene in a different spectrum from the visible spectrum. However, as far as we know from our experiments, our
INPUT
OUTPUT
no-flash color image
deblurred color image optical flow
NIR flash images
motion blur kernel
Fig. 3. Flow of our deblurring method with help of successive NIR flash images
Image Enhancement of Low-Light Scenes with Near-Infrared Flash Images
219
algorithm is insensitive to the difference in spectra and outperforms the most closely related method [5]. Combining large-scale and revised detail images Finally, we recombine the large-scale and the revised detail images and obtain a denoised color image Icdenoised (p) as Icdenoised(p) = Iclarge scale (p) × Icdetail (p). 3.2
(4)
Blur Removal by Using Sequence of NIR Flash Images
We explain how blur in a color image captured in low light with a long exposure time is removed by using NIR flash images, as shown in Fig.3. First, we take a sequence of NIR flash images during the exposure time of a single color image. Then, we directly acquire a motion blur kernel based on optical flow in a similar manner to Ben-Ezra’s method [6]. Finally, we use the Richardson-Lucy deconvolution algorithm [7,8] for deblurring the blurry color image. Estimating blur kernel from NIR flash images We assume a spatially-uniform motion and estimate the blur kernel from NIR flash images as follows. First, we compute the motion between successive frames of NIR images by using optical flow. Then, we join the successive motion and obtain the path of the motion during the exposure time of a single color image. Finally, we convert the motion path into the blur kernel by taking the energy conservation constraint into consideration. As for the implementation details, see Ben-Ezra and Nayar [6].
4
Experiments
We implemented a multispectral imaging system composed of a 3CCD color camera and an NIR camera, as shown in Fig.4. The image of a scene is split by a half mirror. We used SONY XC-003 as the color camera, XC-EI50 as the NIR
NIR camera
Color camera
Half mirror Fig. 4. Prototype of our multispectral imaging system
220
S. Matsui et al.
camera, and a white light source covered with an NIR pass filter. The image coordinates of the two cameras are calibrated based on homography [14]. In the current implementation, we empirically set the parameters as follows. 2 2 In eq.(1), the variances for the dual bilateral filter are σD = 100, σNIR = 87.6, 2 and σY = 22.5. ΩB (p) is an area with 7 × 7 pixels around the pixel p. In eq.(2), is set to 0.02. In eq.(3), ΩN (p) is an area with 21 × 21 pixels around the pixel p and k = 3. The variance of the Gaussian is set to 1.5. 4.1
Denoising Results
First, we demonstrate that our denoising method is applicable to dynamic scenes. The images shown in the first row in Fig.5 are the input no-flash color images. The dynamic range of the images is linearly expanded for display purpose only. The images in the second row are the images simultaneously captured with the NIR camera. The images in the third row are the close-ups of the bounding
Fig. 5. Results for dynamic scene. Images in first and second rows are color and NIR flash images captured at same time. Images in third row are close-ups of bounding boxes in images in first row. Corresponding results of our method are shown in fourth row.
Image Enhancement of Low-Light Scenes with Near-Infrared Flash Images
(a)
(b)
(c)
221
(d)
Fig. 6. Results for static scene. (a) input no-flash image, (b) result obtained using Bennett’s method, (c) result obtained using our method, and (d) temporal average.
60
RMS errors
50 40
no-flash image Bennett’ s method our method
30 20 10
II I
III
0 I
II
III
Fig. 7. Quantitative comparison between our method and Bennet’s method
boxes in the images in the first row, and the images in the fourth row are the corresponding results. One can see that our method significantly reduces noise in images even for a dynamic scene by using a pair of color and NIR flash images. Second, we applied our method to a static scene where the temporal average of no-flash color images is considered to be the ground truth of the denoised image if we assume zero-mean image noise. Fig.6 shows (a) an input no-flash color image, (b) the result obtained using the Bennett’s method, (c) the result obtained using our method, and (d) the temporal average. One can see that the result obtained from our method resembles the temporal average. On the other hand, one can see that the Bennett’s method causes artifacts such as color shifts and blurs. These results demonstrate that our method, which carefully revises the detail image by taking the difference in spectra into consideration, outperforms the Bennett’s method. Next, we quantitatively evaluated the performance of our method. Fig.7 compares the root-mean-square (RMS) errors of the pixel values in the corresponding bounding boxes in the color image. We consider the temporal average image as the ground truth of the denoised image. One can see that our method decreases
222
S. Matsui et al.
(a)
(b)
(c)
(d)
Fig. 8. When NIR image is saturated due to highlights caused by NIR flash, our denoising method does not work well
the RMS errors compared with the input color image and outperforms the Bennett’s method. Finally, Fig.8 demonstrates an example where our method does not work well. One can see that a portion of (b) an NIR flash image is saturated due to highlights caused by NIR flash. In this case, the weights in eq.(3) are large for patches in the highlights because the textures disappeared due to saturations. Thus, (d) the resulting image is blurry. 4.2
Deblurring Results
As shown in Fig.9, we captured (a) nine NIR flash images of a scene during the exposure time of (c) a single no-flash color image. We estimated (b) the spatially-uniform motion blur kernel from the sequence of NIR flash images and obtained (d) the deblurred image. One can see that the motion blur decreases although some artifacts are still visible.
(a)
(b)
(c)
(d)
Fig. 9. (a) NIR flash images for estimating motion blur kernel, (b) estimated blur kernel, (c) blurry input image, and (d) deblurred image
5
Conclusions and Future Work
We presented a novel technique for enhancing an image captured in low light by using a multispectral imaging system, which captures a color image and NIR flash images without causing any interference. The experimental results demonstrate that our denoising method using a pair of color and NIR flash
Image Enhancement of Low-Light Scenes with Near-Infrared Flash Images
223
images is applicable to dynamic scenes and outperforms the existing method that is most closely related to ours. We demonstrated that combining images with different temporal resolutions is effective also for deblurring an image of a low-light scene. The directions of our future work include the enhancement of a noisy and blurry image since our methods are used for a noisy image without blur or a blurry image without noise. In addition, we plan to remove the unpleasant effects of highlights caused by NIR flash.
References 1. Petschnigg, G., Szeliski, R., Agrawala, M., Cohen, M., Hoppe, H., Toyama, K.: Digital photography with flash and no-flash image pairs. In: Proc. SIGGRAPH 2004, pp. 664–672 (2004) 2. Eisemann, E., Durand, F.: Flash photography enhancement via intrinsic relighting. In: Proc. SIGGRAPH 2004, pp. 673–678 (2004) 3. Agrawal, A., Raskar, R., Nayar, S., Li, Y.: Removing photography artifacts using gradient projection and flash-exposure sampling. In: Proc. SIGGRAPH 2005, pp. 828–835 (2005) 4. Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: Proc. CVPR 2005, pp. II-60–65 (2005) 5. Bennett, E.: Computational video enhancement. PhD thesis, The University of North Carolina at Chapel Hill (2007) 6. Ben-Ezra, M., Nayar, S.: Motion deblurring using hybrid imaging. In: Proc. CVPR 2003, pp. I-657–664 (2003) 7. Richardson, W.: Bayesian-based iterative method of image restoration. JOSA 62(1), 55–59 (1972) 8. Lucy, L.: An iterative technique for the rectification of observed distributions. Astronomical Journal 79(6), 745–754 (1974) 9. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proc. ICCV 1998, pp. 839–846 (1998) 10. Zhang, X., Sim, T., Miao, X.: Enhancing photographs with near infra-red images. In: Proc. CVPR 2008, pp. 1–8 (2008) 11. Krishnan, D., Fergus, R.: Dark flash photography. In: Proc. SIGGRAPH 2009 (2009) 12. Yuan, L., Sun, J., Quan, L., Shum, H.-Y.: Image deblurring with blurred/noisy image pairs. In: Proc. SIGGRAPH 2007 (2007) 13. Tai, Y.W., Du, H., Brown, M., Lin, S.: Image/video deblurring using a hybrid camera. In: Proc. CVPR 2008, pp. 1–8 (2008) 14. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press, Cambridge (2004)
A Novel Hierarchical Model of Attention: Maximizing Information Acquisition Yang Cao and Liqing Zhang MOE-Microsoft Laboratory for Intelligent Computing and Intelligent Systems, and Department of Computer Science and Engineering, Shanghai Jiao Tong University, No. 800 Dongchuan Road, Shanghai, 200240, China
[email protected]
Abstract. A visual attention system should preferentially locate the most informative spots in complex environments. In this paper, we propose a novel attention model to produce saliency maps by generating information distributions on incoming images. Our model automatically marks spots with large information amount as saliency, which ensures the system gains the maximum information acquisition through attending these spots. By building a biological computational framework, we use the neural coding length as the estimation of information, and introduce relative entropy to simplify this calculation. Additionally, a real attention system should be robust to scales. Inspired by the visual perception process, we design a hierarchical framework to handle multi-scale saliency. From experiments we demonstrated that the proposed attention model is efficient and adaptive. In comparison to mainstream approaches, our model achieves better accuracy on fitting human fixations. Keywords: Visual Saliency, Information Acquisition, Relative Entropy, Hierarchical Framework, ROC Area.
1
Introduction
Selective visual attention, as a fundamental mechanism in human visual system, serves to allocate our limited perception and processing resources to the most informative stimuli embedded in natural scenes. Since it affects the way how we observe the world from our eyes, the potential applications of attention model are quite broad, such as target detection, surveillance and robot control. In this paper, we mainly focus on bottom-up, saliency-driven attention. The major task of computational attention model is to interpret and simulate how attention is deployed given a visual scene. In the early visual path, incoming sensory signals are encoded into primary features. Fundamental work done by Koch and Ullman [1] hypothesized that, various feature maps are integrated into a single topographic map called saliency map (Fig. 1). An active location in this two-dimensional map suggests that corresponding location in the visual scene is salient. Later research [2] showed that such kind of map is probably located in H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 224–233, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Novel Hierarchical Model of Attention
225
Fig. 1. A natural image (left) and its saliency map (right) from human fixations
the visual cortex. In recent decades, a number of models (see [3] [4] for a survey) have been developed for producing the saliency map. Since Bruce et al. [5] proposed a credible and practical approach to evaluate model performance by comparing predicted saliency maps and eye tracking data from human experiments, the major challenge has focused on how to accurately predict spots which may lead fixations in a realistic natural scene. In the earliest visual stage, some simple features such as orientations, intensity and colors could trigger saliency [6]. Usually, it is difficult to decide whether an incoming stimulus is salient or not without considering its surrounding context. Inspired by the receptive field in visual system, Itti et al. [2] developed a computational model to measure saliency by using a series of center-surround detectors, which calculate local contrast of features. Moreover, recent researchers defined saliency as the power of features to discriminate [7], the spectral residual [8], or the selfinformation [9]. Hou et al. [10] proposed the incremental coding length hypothesis to predict saliency, which shows good performance in dynamic situations. Many researchers regard saliency as irregular patterns, and their models mainly focus on comparing elementary features in images [11]. Actually, the origin of attention is the limitation of processing capacity in the visual system [12]. Under metabolical constraints, primates could not perceive all incoming stimuli, instead, selecting the most informative part [13], such as food and enemies, in order to survive in the complex and competitive environment. This fact is crucial in developing computational models of attention [14]. Inspired by [15], we propose a novel visual attention model based on maximizing information acquisition. The saliency map produced by our model reflects the information distribution on the natural scene, where the most salient spots contain the largest amount of information. By attending these spots preferentially, the visual system achieves the maximum transmission efficiency of information. It is troublesome to directly calculate the information amount of a visual region. A general method is to build a probability estimation from mass observation data, which is artificial and time consuming. Based on sparse representation, our model represents natural images as the primary visual cortex, and information is estimated by the neural coding length. Unlike other models, we introduce relative entropy to measure saliency, which is efficient and biological plausible. Further discussion could be seen in Sect. 4.
226
Y. Cao and L. Zhang
Owing to the limited resolution of the early visual processing [16], visual perception is actually a progressive process. Human could not perceive all details in a complex natural scene without a slow exhaustive scan. From [17] we know that, in different sizes of view, saliency focuses on different objects. Thus, the scale does affect our attention behavior. Mainstream models usually adopt a fix size of receptive fields [5] [10]. When facing a natural scene with multi-scale salient objects, they are not robust enough. Inspired by [18], we designed a hierarchical framework, which is parameterfree and adaptive for any scaling, to deal with multi-scale problem of saliency. In Sect. 3, we make comparisons to mainstream approaches in static natural images, and experiment results demonstrated that our model achieved the best accuracy on fitting human fixations.
2
The Model
The proposed model has close ties with neurobiology mechanisms of attention. Firstly, we decompose image patches to simple features through sparse representation. Then, after computing neural firing rates of all features, we measure local saliency by relative entropy, which calculates the change of neural coding lengths. Finally, motivated by the visual perception process, we design a hierarchical framework to combine temporary results into a single saliency map. 2.1
Sparse Representation for Natural Scenes
Existing evidence indicates that the primate visual system encoded information by establishing a sparse representation, that is, each kind of feature is responded by only a small particular population of neurons at one time [19]. This sparse strategy is very economic, and it is believed as a fundamental principle in sensory processing of nervous system. Motivated by this common fact, we choose sparse features to represent natural scenes. With the standard independent component analysis (ICA) [20], we learned a set of sparse basis using 120000 RGB image patches of size 8 × 8, which are randomly sampled from natural images. There are totally 8 × 8 × 3 = N reconstruction basis A and corresponding N filtering basis W (Fig. 2). Let A = [a1 , a2 , . . . , aN ] and W = [w1 , w2 , . . . , wN ] be the vectorized sparse basis, the neuron response of a vectorized image patch x on feature i is: si = xwiT
(1)
As pointed out by [21], basis (Fig. 2) trained by ICA resemble wavelets or Gabor functions, which simulate the primary features responded by simple-cells in V1. Due to the functional similarity between sparse basis in the computational model and primary neurons in the early visual path, it is very convenient to perform the neural coding analysis in our next step.
A Novel Hierarchical Model of Attention
227
W
A
Fig. 2. A subset of reconstruction basis (left) and corresponding filtering basis (right)
2.2
Saliency from Coding Length
Aiming at maximizing information acquisition, our model generates the information distribution on the whole scene. Therefore, local saliency is easy to be measured according to information amount of each spot. Entropy, as a common method to estimate information, is not suitable because it prefers the location with the most complicate texture, which usually appears in background. In fact, saliency only makes sense when given a concrete context. To measure saliency of a specific spot, conditional entropy is a practicable tool. However, it is neither necessary nor efficient to estimate conditional probabilities using mass observed data, for our goal is to provide a fast and accurate solution. Based on sparse representation, our model perceives natural scenes as the primary visual cortex. Thus we could use the neural coding length as an estimation of information amount. Since saliency is defined on contrast, given a patch in the image, it is only necessary to compare the coding length between the context of the patch and the whole image, which reflects the information difference before and after observing this patch. If this difference is large, it means this Neuron Activities
Image Basis
Patch
Relative Entropy
Saliency
Scan Basis Complement
Neuron Activities
Fig. 3. The work flow of our algorithm. This sample contains green grass with a conspicuous white object in the center. The algorithm analyzes saliency of each patch by calculating information difference between the image and its corresponding complement.
228
Y. Cao and L. Zhang
patch contains lots of information, that is, saliency. We define the context as the complement image, which is obtained by directly eliminating the patch from the whole image. Based on information theory, we introduce relative entropy to calculate the change of coding length. Here we provide a detailed description of our algorithm. First it serially scans the whole image I like a spotlight (Fig. 3). Let S = [s1 , s2 , . . . , sn , . . . ] be patches to be analyzed. Given a specific si , Ci is the complement part of si in the image. Supposing X = [x1 , x2 , . . . , xm , . . . ] are constituent patches of I, the neural firing rate of feature k on I is: (xm wkT )2 fk = N m (2) T 2 i=1 m (xm wi ) F I = [f1 , f2 , . . . , fN ] is the probability function of neuron activities on I. Similarly, the probability function on Ci is calculated in the same way, which is FiC = [fi,1 , fi,2 , . . . , fi,N ]. We measure saliency of si by the information difference between F I and FiC : Saliency(si ) = DKL (F I FiC ) N fj = fj log fi,j j=1 = H(F I ) −
N
(3)
fj log(fi,j )
j=1
DKL (·) is the relative entropy operator and H(·) is the entropy operator. In the view of neural coding, relative entropy in (3) means that, given the coding of Ci , we need extra DKL (F I FiC ) coding length to describe I in average cases. From another point of view, saliency is based on uncertainty. If we already know si has same content as Ci does, we do not need additional coding length to represent si because information from Ci is sufficient. To sum up, relative entropy is a good tool to calculate the change of the coding length. Our model not only provides a simple and fast algorithm to quantify saliency, but also has a neurobiology background of the visual system. 2.3
A Hierarchical Framework for Multi-Scale Saliency
Inspired by the visual perception process, we design a hierarchical framework to progressively analyze saliency in different scales. Empirically, we set three layers in our model: global view, middle view and detailed view, which simulate three sizes of visual field form large too small. In the last two stages, we calculate saliency in sub-windows instead of the whole image, and then feed these submaps of saliency into an integral one as the result of this layer. In the end, without weight tuning, three temporary maps are combined into one final saliency map through addition.
A Novel Hierarchical Model of Attention
229
Image 128
Produce Saliency
Temporary Map
Global View 32 in 128 Middle View 16 in 64 Detailed View 8 in 32
Addition Human
Final Result
Fig. 4. A hierarchical framework for producing multi-scale saliency. Our framework contains three stages with different window sizes. The pair of numbers, 16 in 64 for example, means the side length of scan windows is 16 and the length of corresponding context windows is 64 at this stage.
Taking Fig. 4 for example. A green ball and three pink candleholders are saliency from human records. Since their sizes are diverse, saliency detectors with fix receptive fields are not robust enough. In our experiment, at the global view stage, the side length of scan windows is 32 and context windows is the maximum, 128. As a result, only the big green ball is popped out. Then at the middle view stage, side lengths of scan windows and context windows are reduced to half, which are 16 and 64 respectively. On this level, both the green ball and pink candleholders are marked as saliency at the cost of some tablecloth coming out. At the last stage, side lengths of windows are further reduced, which are 8 and 32. On this level, most details are captured. In the end, we add up three temporary maps without weight tuning to form the final saliency map. This operation resembles a competition among different layers, for only those spots containing large total amount of information could be salient, which coincides the objective of our model. As we can see, major targets still remain (blue circles) but noises decay (yellow circle). In fact, the sizes of windows are set merely for computational convenience. For example, the ratio of the scan window area to the context window area at each stage is 1/4, which could be substituted by any practical factor. Besides, it is also a common strategy to reduce the window size by a factor of 2 between two adjacent levels. Thus we just provide an idea using this hierarchical framework to adaptively deal with multi-scale saliency.
230
3
Y. Cao and L. Zhang
Experimental Validation
According to the validation approach and database proposed by [5], we made comparisons between saliency maps produced by our model and eye fixation records from human. This database contains 120 natural images covering a wide range of situations (images in our experiments are down-sampled to 128 × 170), and the eye tracking data is collected from 20 subjects. The quantitive evaluation of performance is the area under ROC curve [17]. Given a specific saliency map, the work flow is in Fig. 5. The model performance is evaluated by the average ROC area from the whole database. We compared our model with the mainstream models (Table. 1), which Table 1. Our model versus mainstream models Attention Model
Average ROC area
Our Model Hou et al.[8] Gao et al.[22] Bruce et al.[5] Itti et al.[23]
Improvement
0.8238 0.7776 0.7729 0.7697 0.7271
0% 5.94% 6.59% 7.02% 13.3%
Produce
Original Image Threshold = 0.8
0.6
Saliency Map 0.4
0.2
Binarize 0.0
Hit Rate
Compare
ROC Curve Area = 0.834
Fixation Map
False Alarm Rate
Fig. 5. Work flow of calculating ROC area. The saliency map is regarded as a series of binary classifiers by setting different thresholds, from 0 to 1, totally 200 in our experiment. As the threshold declines, the binary map predicts more fixation points. Through comparing a binary map with the human fixation map, we get its hit rate and false alarm rate. In the end, every binary map is finally projected as a point in the ROC curve. The higher the ROC area is, the more accurately the model performs on the image.
A Novel Hierarchical Model of Attention Input
We
Human
Gao et al.
231
Bruce et al.
Fig. 6. Results of the qualitative comparison. The first image in each line is the original input, the rest saliency maps from left to right are produced by our model, human fixations, Gao et al. [22] and Bruce et al. [5].
are all validated by the same benchmark and evaluation method. Due to a different sampling density in ROC depicting, the performance could have a slight bias (±0.003), however, their relative relationship is unchanged. From experimental results, our model achieved the best performance (0.8238), and the accuracy improvement is apparent (more than 5.9%). Qualitative illustrations are showed in Fig. 6.
4
Discussion
In this section, we first analyze the computational complexity of our algorithm, then further discuss our model in methods and biological principles. The computational complexity of the global stage in our model is O(M ), where M is the number of scan windows (see Fig. 4 for a review). The other two stages (middle view and detailed view) need more calculation, for they both have two loops in the algorithm. In fact, if we need a trade-off between speed and accuracy, only global view stage is enough, for it costs less than 0.06s per image (Core Duo 2.2G, RAM 2G) but gains more than 0.79 average ROC area in our experiments, which has already exceeded the performance of other models. Therefore, our model is definitely capable for dynamical prediction on line. Many models followed by [1] measure saliency in separate feature channels, typically colors, intensity and orientations, and saliency map is combined from these independent feature maps, which is so-called “feature integration theory”. In contrast, our model presents images on sparse basis, which naturally avoids
232
Y. Cao and L. Zhang
the integration problem. These basis describe all features in a unified space, which is actually more “integrative”. For it is a complete representation of natural images, the sparse features not only contain more information than these artificial features, but also explore latent primary features which could contribute to saliency. Though relative entropy is non-commutative (we could define its operating direction), it is an ideal tool to measure information gain. In the view of probability theory, relative entropy is also called Kullback-Leibler (K-L) divergence, which reflects how different it is between two probability distributions. Therefore, our model is compatible with those models which are designed for finding irregular patterns. However, the relative entropy could not be substituted by direct entropy subtraction H(F I ) − H(FiC ). Otherwise, information amounts of context are not equal in two operators, because their coding based on different distributions. Among previous researches, Itti et al. [24] used K-L divergence as a model verification tool, and Gao et al. [25] adopt it for feature selection in recognition task. Thus both their goals and methods are quite different from ours.
5
Conclusion
Driven by neural principles of attention, we propose a novel hierarchical attention model for maximizing information acquisition. By representing natural images as the visual system, this biological plausible model measures the local saliency from relative entropy, and use a perception-inspired architecture to boost the accuracy of the saliency map. Demonstrated by experiments, our model could be used as an efficient and adaptive calculation framework for saliency, which achieves the best performance among state-of-the-art models.
Acknowledgments The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301), the Science and Technology Commission of Shanghai Municipality (Grant No. 08511501701), and the National Natural Science Foundation of China (Grant No. 60775007).
References 1. Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology 4(4), 219–227 (1985) 2. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. Pattern Analysis and Machine Intelligence 20(11), 1254–1259 (1998) 3. Kadir, T., Brady, M.: Saliency, scale and image description. International Journal of Computer Vision 45(2), 83–105 (2001) 4. Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2(3), 194–203 (2001)
A Novel Hierarchical Model of Attention
233
5. Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Advances in Neural Information Processing Systems, vol. 18, pp. 155–162 (2006) 6. Maunsell, J.H., Treue, S.: Feature-based attention in visual cortex. Trends in Neurosciences 29(6), 317–322 (2006) 7. Gao, D., Vasconcelos, N.: Decision-theoretic saliency: Computational principles, biological plausibility, and implications for neurophysiology and psychophysics. Neural computation 21(1), 239–271 (2009) 8. Hou, X., Zhang, L.: Saliency detection: A spectral residual approach, pp. 1–8 (2007) 9. Bruce, N., Tsotsos, J.: Saliency, attention, and visual search: An information theoretic approach. Journal of Vision 9(3), 5 (2009) 10. Hou, X., Zhang, L.: Dynamic visual attention: searching for coding length increments. In: Advances in Neural Information Processing Systems, vol. 21, pp. 681–688 (2009) 11. Boiman, O.: Detecting irregularities in images and in video. International Journal of Computer Vision 74, 17–31 (2007) 12. Balasubramanian, V., Kimber, D., Berry II, M.J.: Metabolically efficient information processing. Neural Computation 13(4), 799–815 (2001) 13. Wainwright, M.J.: Visual adaptation as optimal information transmission. Vision Research 39(23), 3960–3974 (1999) 14. Zhang, L., Tong, M., Marks, T., Shan, H., Cottrell, G.: SUN: A Bayesian framework for saliency using natural statistics. Journal of Vision 8(7), 32 (2008) 15. van Hateren, J.H.: Real and optimal neural images in early vision. Nature 360, 68–70 (1992) 16. Intriligator, J., Cavanagh, P.: The spatial resolution of visual attention. Cognitive Psychology 43(3), 171–216 (2001) 17. Tatler, B.W., Baddeley, R.J., Gilchrist, I.D.: Visual correlates of fixation selection: effects of scale and time. Vision Research 45(5), 643–659 (2005) 18. Deco, G., Schurmann, B.: A hierarchical neural system with attentional topdown enhancement of the spatial resolution for object recognition. Vision Research 40(20), 2845–2859 (2000) 19. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive field propertiesby learning a sparse code for natural images. Nature 381, 607–609 (1996) 20. Bell, A.J., Sejnowski, T.J.: The independent components of natural scenes are edge filters. Vision Research 37(23), 3327–3338 (1997) 21. Hyvarinen, A., Hoyer, P., Hurri, J., Gutmann, M.: Statistical models of images and early vision. In: Proceedings of the Int. Symposium on Adaptive Knowledge Representation and Reasoning, pp. 1–14 (2005) 22. Gao, D., Mahadevan, V., Vasconcelos, N.: The discriminant center-surround hypothesis for bottom-up saliency. In: Advances in Neural Information Processing Systems, vol. 20, pp. 497–504 (2008) 23. Itti, L., Koch, C.: A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research 40, 1489–1506 (2000) 24. Itti, L., Baldi, P.: Bayesian surprise attracts human attention. In: Advances in Neural Information Processing Systems, vol. 18, pp. 547–554 (2006) 25. Gao, D., Vasconcelos, N.: Discriminant saliency for visual recognition from cluttered scenes. In: Advances in Neural Information Processing Systems, vol. 17, pp. 481–488 (2005)
Interactive Shadow Removal from a Single Image Using Hierarchical Graph Cut Daisuke Miyazaki1, , Yasuyuki Matsushita2 , and Katsushi Ikeuchi1 1
The University of Tokyo, Institute of Industrial Science, Komaba 4–6–1, Meguro-ku, Tokyo, 153–8505 Japan {miyazaki,ki}@cvl.iis.u-tokyo.ac.jp 2 Microsoft Research Asia, Sigma Center, No. 49, Zhichun Road, Haidian District, Beijing, 100190 P.R.C.
[email protected]
Abstract. We propose a method for extracting a shadow matte from a single image. The removal of shadows from a single image is a difficult problem to solve unless additional information is available. We use user-supplied hints to solve the problem. The proposed method estimates a fractional shadow matte using a graph cut energy minimization approach. We present a new hierarchical graph cut algorithm that efficiently solves the multi-labeling problems, allowing our approach to run at interactive speeds. The effectiveness of the proposed shadow removal method is demonstrated using various natural images, including aerial photographs.
1 Introduction Shadows in an image reduce the reliability of many computer vision algorithms, such as shape-from-X, image segmentation, object recognition and tracking. Also, shadows often degrade the visual quality of the images, e.g., causing inconsistencies in a stitched aerial photograph. Shadow removal is therefore an important pre-processing step for computer vision algorithms and image enhancement. Decomposition of a single image into a shadow image and a shadow-free image is essentially a difficult problem to solve unless additional prior knowledge is available. Although various types of prior information have been used in previous approaches, the task of shadow removal remains challenging. Because the previous techniques do not use a feedback loop to control the output, it has not been possible to refine the output in the intended manner. As a result, it is still a time-consuming task to remove the shadows, especially from the more difficult examples. To address this problem, we developed an efficient computation method for the shadow removal task. Unlike the previous shadow removal methods, our method allows the user to interactively and incrementally refine the results. The interaction speed is achieved by using a new formulation for shadow removal in a discrete optimization framework and a solution method.
This work was done while the first author was a visiting researcher at Microsoft Research Asia. The first author is currently with Hiroshima City University, Japan.
H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 234–245, 2010. c Springer-Verlag Berlin Heidelberg 2010
Interactive Shadow Removal from a Single Image Using Hierarchical Graph Cut
235
The chief contributions of this paper are as follows: MRF formulation for shadow removal. We, like Nielsen and Madsen [1], formulated the problem of shadow matte computation in a Markov random field (MRF) framework. Unlike their approach, we used the user-supplied hints fully as prior information, while discrete optimization techniques can find the best solution using the prior information. Hierarchical graph cut. To achieve the interactive speed, we developed a hierarchical optimization method for the multi-labeling problem. The method produces a sub-optimal solution, and has the order of log n time complexity, while the αexpansion [2] and Ishikawa’s graph cut [3] have the order of n time complexity. Interactive optimization. Our system interactively and incrementally updates the estimates of the shadow matte. The estimates are refined by the user via a stroke-based user interface. We validated the effectiveness of our technique quantitatively and qualitatively using various different input images. 1.1 Prior Work Shadow removal algorithms can be categorized into two classes: multiple-image and single-image methods. Weiss [4] proposed a multiple-image method for decomposing an input image sequence into multiple illumination images and a single reflectance image. This method was extended by Matsushita et al. [5] to produce multiple illumination images and multiple reflectance images. These methods require several images taken from a fixed viewpoint, which limits their application. Both automatic and interactive techniques have been proposed for the removal of shadows from a single image. Finlayson et al. [6] presented an automatic method that detects the shadow edge by entropy minimization. Fredembach and Finlayson [7] extended that method to improve the computational efficiency. These two methods aim to detect the shadow edges using physics-based methods, but they require the illumination chromaticity of the shadow area to be different to that of the non-shadow area. On the other hand, Tappen et al. [8] took a learning-based approach by creating a database of edge images to determine the shadow edges robustly. Instead of using the edge information, other works use the brightness information. Baba et al. [9] estimated gradually changing shadow opacities, assuming that the scene does not contain complex textures. Conversely, the method proposed by Arbel and Hel-Or [10] can handle scenes with complex textures, but it does not handle gradual changes in the shadow opacity. Nielsen and Madsen [1] proposed a method that can estimate gradually changing shadow opacities from complex textured images. However, their method remains limited due to the simple thresholding method used to detect the shadow edges. Recently, interactive methods have been gaining attention, enabling the user to supply hints to the system to remove shadows from difficult examples. Wu and Tang’s method [11,12] removes shadows when given user-specified shadow and non-shadow regions. It adopts a continuous optimization method that requires many iterations to
236
D. Miyazaki, Y. Matsushita, and K. Ikeuchi
converge. As a result, it is not straightforward to use their method in an interactive and incremental manner. Our method solves this problem by formulating the problem in an MRF framework. 1.2 Overview of Our Approach The overview of our shadow removal method is illustrated in Fig. 1. First, we automatically over-segment the input image to produce a set of super-pixels. In the next stage, region segmentation stage, the user specifies the shadow, non-shadow, and background regions using a stroke-based interface such as Lazy Snapping [13]. Using the likelihood of the non-shadow region as prior information, our method removes the shadows using a hierarchical graph cut algorithm at this initial removal stage. To further improve the results, the user can specify areas where the shadow was not perfectly removed. The parameters of the hierarchical graph cut algorithm are updated by additional user interaction at this interactive refinement stage, and the improved output is displayed to the user rapidly.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. Illustration of the shadow removal process. (a) The input is a single image. (b) The image is segmented into small super-pixels that are used in steps (c) and (d). (c) The user draws strokes to specify the shadow, non-shadow, and background regions. (d) Our graph cut shadow removal algorithm is applied to the image using default parameters. (e) The user specifies any defective areas in the results from step (d), and the graph cut shadow removal algorithm recalculates a shadow-free image using the updated parameters. (f) The resulting shadow-free image.
2 MRF Formulation of Shadow Removal This section describes our graph cut shadow removal algorithm. We begin with the image formation model of Barrow and Tenenbaum [14]. The input image I can be expressed as a product of two intrinsic images, the reflectance image R and the illumination image L, as I = RL. (1) The illumination image L encapsulates the effects of illumination, shading, and shadows. We can further decompose the illumination image into L = βL , where β represents the opacity of the shadows, defined as a function of the shadow brightness, and L represents the other factors of L. Hence, Eq. (1) can be written as I = βRL , or more simply, I = βF, (2)
Interactive Shadow Removal from a Single Image Using Hierarchical Graph Cut
237
as in Wu and Tang [11]. Here, F represents the shadow-free image. β and F are interdependent, i.e., if we know β, we also know F . Therefore, our problem is the estimation of β from the input image. We designed an energy function characterized by four properties: (1) the likelihood of the texture (Dt ); (2) the likelihood of the umbra (Du ); (3) the smoothness of the shadow-free image F (Df ); and (4) the smoothness of the shadow image β (V b ). b E(β) = λt Dpt (βp ) + λu Dpu (βp ) + λf Dpf (βp ) + λb Vp,q (βp , βq ), (3) p∈P
{p,q}∈N
where E(β) is the total energy of all nodes P and edges N . The parameters λt , λu , λf , and λb are the weight factors of the corresponding cost variables. p and q represent the node indices. The likelihood cost of the texture Dt is related to the probability density function (pdf) of the non-shadow region. Assuming that the likelihood P of the non-shadow region is the same as the likelihood of the shadow-free image F , the cost function can be formulated as Dpt (βp ) = − log P (Ip /βp ), p ∈ P, (4) where I/β represents F . We represent the pdf P as a 1D histogram for each 1D color channel. We do not estimate all three of the color channels using a 3D pdf, since graph cuts cannot optimize a vector. Dividing the average intensity of the non-shadow region by the average intensity of the shadow region yields a good initial estimate for β, which we denote as β0 . At the initial removal stage, we, like Wu and Tang [11], use β0 as an initial value for β. The inner part of the shadow region (the umbra) has a value which is close to β0 , while the shadow boundary (the penumbra) varies from β0 to 1, with 1 representing a nonshadow region. In order to express the above characteristics, we introduce the following cost function for Du : Dpu (βp ) = |βp − β0 |0.7 + |βp − 1|0.7 ,
p ∈ P.
(5)
L0.7-norm is useful for separating two types of information [15], and we also use it here to decompose the input image into the shadow and non-shadow images. We also employ a smoothness term for the shadow-free image F and the shadow image β. The hierarchical graph cut is based on the α-expansion [2], and the α-expansion requires the smoothness measure to be a metric [2]. The Euclidean distance is a metric, and thus we set the smoothness term of the shadow image β as follows: b Vp,q (βp , βq ) = |βp − βq |,
{p, q} ∈ N .
(6)
Although the smoothness term defined in Eq. (6) can be solved using either the αexpansion or Ishikawa’s method [3], we solve it using a hierarchical graph cut in order to reduce the computation time. We also set the smoothness term of the shadowless image, but because F = I/β, we cannot define a metric cost. We therefore calculate the smoothness term of the shadowless image as follows and add it to the data term: Dpf (βp ) = |∇(Ip /βp )|2 ,
p ∈ P.
(7)
238
D. Miyazaki, Y. Matsushita, and K. Ikeuchi
(a) User input
(b) Region segmentation
(c) Over-segmentation
Fig. 2. Results of image segmentation. (a) Shadow, non-shadow, and background regions specified by red, blue, and green strokes drawn by the user. (b) The image segmented into shadow, non-shadow, and background regions represented as red, blue, and green regions. (c) The image segmented into small super-pixels.
The value Fp = Ip /βp is calculated by fixing Fq where q denotes the neighboring pixels of p. In order to construct the prior functions Dt and Du in Eq. (3), we separate the image into three regions: shadow, non-shadow, and background (Fig. 2 (b)). Like the previous image segmentation methods [13,16], we ask the user to mark each region using a stroke-based interface, as shown in Fig. 2 (a). Also, to accelerate the region segmentation stage, we segment the image into small super-pixels in the over-segmentation stage [13,16], as shown in Fig. 2 (c).
3 Interactive Parameter Optimization In this section, we explain how to interactively update the weighting parameters λ introduced in Eq. (3). Note that, in the initial removal stage, we apply our graph cut shadow removal algorithm with the default weighting parameters. The interactive parameter optimization algorithm is described in the following formula. ˆ = argmin Λ Λ
|βˆp − βμ |2 ,
s.t . {βˆp |p ∈ Ω0 } = graph cut(βp |p ∈ Ω0 ; Λ), (8)
p∈Ω0
where Λ ≡ {λt , λu , λf , λb }. The system automatically updates the parameters Λ so that the shadow image β will be close to the ideal value βμ . The ideal value is specified from the starting point of the stroke input by the user. The area to be examined is specified by the painted area Ω0 . By considering the trade-off between the precision and the computation speed, we limited the iterations of Eq. (8) to 4. Eq. (8) represents the case for the shadow image β, and the case for the shadow-free image F is similar. The system increases λt and λf for smooth textures, and increases λu and λb for constant shadow areas (Fig. 3).
Interactive Shadow Removal from a Single Image Using Hierarchical Graph Cut
(a) Input
(b) Initial
(c) 1st
(d) 2nd
( ) 33rd (e) d
(f) 4th 4h
( ) 5th (g) 5h
(h) 6th 6h
239
Fig. 3. The image enhanced by user-specified strokes. According to the user’s strokes, the algorithm automatically finds the parameters which reflect the user’s intentions. The strokes are represented by magenta pixels. In this example, 1–3rd strokes are added to the shadow-free image and 4–6th strokes are added to the shadow image. (a) is the input image, (b) is the initial state, and (c)–(h) are the results after the 1–6th strokes.
4 Hierarchical Graph Cut In order to improve the computation speed of the n-label graph cut, we propose a hierarchical graph cut. The algorithm uses a coarse-to-fine approach to run more quickly than both the α-expansion method [2] and Ishikawa’s method [3]. We first explain the benefits of our approach as compared with the previous methods. A hierarchical approach to region segmentation or image restoration has been previously studied [17,18,19,20,21]; however, few methods have employed a hierarchical approach which can be applied to other applications. Juan et al. [22] use an initial value before solving a graph cut to increase the computation speed, but it is only twice as fast as the α-expansion. A method called LogCut proposed by Lempitsky et al. [23] is much faster, but it requires a training stage before it can be applied. On the other hand, the method proposed by Komodakis et al. [24] does not need any training stages, but the method only improves the computation time of the second and subsequent iterations, not the first iteration. We propose a hierarchical graph cut which is faster than the α-expansion when applied to a multi-label MRF. The pseudo-code of the hierarchical graph cut is described in Algorithm 1. For each iteration, the α-expansion solves the 2-label MRF problem, where one is the current label and the other is stated as α. Our hierarchical graph cut uses multiple “α”s for each
240
D. Miyazaki, Y. Matsushita, and K. Ikeuchi
α ep α
ea α eaq
epa p
eq α
a eap
q eqa e βa
eβ p
β
eβq
edge epα , eβp eβa , eaα , epa , eap , eaq , eqa eβa eaα eap , eaq epa eqa epa , eqa eaq eap
weight Same as α-expansion
for p∈P {p, q} ∈ N , αp = αq
V (βp , βq ) V (αp , αq ) ∞ [V (αp , βq ) − V (βp , βq )]+ [V (βp , αq ) − V (βp , βq )]+ ∞ [V (αp , βq ) − V (αp , αq )]+ [V (βp , αq ) − V (αp , αq )]+
{p, q} ∈ N , αp = αq {p, q} ∈ N , αp = αq , V (βp , βq ) ≤ V (αp , αq ) {p, q} ∈ N , αp = αq , V (βp , βq ) ≥ V (αp , αq )
Fig. 4. Graph construction for our graph cut. [x]+ = max(0, x). The nodes p and q are the neighboring nodes. The node a is the auxiliary node added in order to set the weights properly. The nodes α and β are the sink and source nodes, respectively.
Algorithm 1. Hierarchical graph cut B ≡ {βp |p ∈ P} ⇐ initial value A ⇐ Eq. (9) success ⇐ 0 for i = 0 to 2 log2 n − 1 do g ⇐ null for all p ∈ P do αp ⇐ argminα∈Ai |βp − α| g ⇐ g ∪ graph(αp , βp ) // see Fig. 4 9: end for 1: 2: 3: 4: 5: 6: 7: 8:
for all {p, q} ∈ N do αp ⇐ argminα∈Ai |βp − α| αq ⇐ argminα∈Ai |βq − α| g ⇐ g∪graph(αp , αq , βp , βq ) // see Fig. 4 end for B ⇐ max-flow(B, g) if E(B ) < E(B) then B ⇐ B , success ⇐ 1 17: end for 18: if success = 1 then goto 3 10: 11: 12: 13: 14: 15: 16:
iteration (line 11 and line 12 in the algorithm 1). The list of multiple αs is represented by A in line 2, and is defined as: 2j−1 −1 3 + 4k 1 + 4k A0 = {0} , A1 = , A2j = n , A2j+1 = n , 2 2j+1 2j+1 k=0 k=0 (9) where j represents the level of the hierarchical structure, and n represents the number of labels. A represents the hierarchical structure since the number of A increases exi ponentially (i.e., |Ai | = max(1, 2 2 −1 )). Our method only requires 2 log2 n times for each iteration (line 4 in the algorithm 1) thanks to the hierarchical approach, while the α-expansion needs n times for each iteration. Although Ishikawa’s method does not require any iterations, the computation time is almost the same as for the α-expansion, since the number of nodes for the computation is n times larger than for the α-expansion. n
2j−1 −1
Interactive Shadow Removal from a Single Image Using Hierarchical Graph Cut
241
5 Experiments 5.1 Hierarchical Graph Cut First, we experimentally validate the performance of the hierarchical graph cut algorithm in three domains: shadow removal, image restoration, and stereo matching. We used the stereo data sets introduced in [25]. The results shown in Fig. 5 indicate that our algorithm produces similar results to the α-expansion [2] and to Ishikawa’s method [3]. Table 1 shows that the hierarchical graph cut is 3 to 16 times (or 5 to 8 times) faster than the α-expansion (or Ishikawa’s method). The change in the value of the cost function
Stereo matching 1
Stereo matching 2
Image restoration
Shadow removal (a)
(b)
(c)
(d)
(e)
Fig. 5. Results of our hierarchical graph cut. (a), (b), (c), (d), and (e) show the input, the ground truth, the result for Ishikawa’s method, the result for α-expansion, and the result for the hierarchical graph cut, respectively.
D-expansion p
Hierarchical g graph p cut 1.1u107
1.7u106
Energy gy
Energy
0
Computation Co putat o ttimee [sec] (a)
250
1.5u106
Ishikawa's method
Iteration number (b)
7
Fig. 6. The plot of the value of overall cost of experiment “stereo matching 2” vs. the time (a) or iteration (b).
242
D. Miyazaki, Y. Matsushita, and K. Ikeuchi
Table 1. Computation speed of our h-cut. The first row specifies the experiments. The second and third rows show the number of labels used and the image size. The fourth and fifth rows show the ratio of computation times for x-cut vs. h-cut and a-cut vs. h-cut. The error differences for h-cut minus x-cut and h-cut minus a-cut are shown in the sixth and seventh rows, where a positive value means that the other methods outperform our method. The number of iterations required until convergence occurs is shown in the eighth and ninth rows for a-cut and h-cut. x-cut does not need iterations. The memory size required is shown in the tenth, eleventh, and twelfth rows for x-cut, a-cut, and h-cut. [notations] x-cut: Ishikawa’s exact optimization. a-cut: α-expansion. h-cut: Hierarchical graph cut. Problem Labels Image size Speed-up vs. x-cut vs. a-cut Error vs. x-cut difference vs. a-cut Iteration a-cut h-cut Allocated x-cut memory a-cut h-cut
Stereo Stereo Image matching 1 matching 2 restoration 128 128 256 543 × 434 480 × 397 256 × 256 ×5.2 ×5.9 ×8.0 ×6.8 ×11.4 ×16.6 +5.0% +3.2% +0.6% +4.6% +3.0% −0.4% 8 7 10 10 5 7 6, 748 MB 5, 537 MB 5, 306 MB 106 MB 111 MB 65 MB 106 MB 120 MB 64 MB
Shadow removal 64 640 × 480 ×6.4 ×3.4 +0.0% +0.0% 4 6 4, 538 MB 482 MB 482 MB
at the first iteration is large, while it is negligible after the second iteration, as shown in Fig. 6. The disadvantage of the hierarchical graph cut is that the result depends on the initial value when there are many local minima, as shown in the stereo matching result. Due to its fast computation speed, we use the hierarchical graph cut in our shadow removal algorithm instead of the α-expansion and Ishikawa’s method. 5.2 Shadow Removal Natural image Our shadow removal results are shown in Fig. 7. The shadows were removed effectively while the complex textures of the images were preserved. We express the shadow opacity using 64 discrete values, but these results show that this discretization does not cause any strong defects. The number of user interactions required for parameter optimization is listed in Table 2. The system displays the output image at the responsive speed. Aerial images. In aerial images, the shadows of buildings fall both on the ground and on neighboring buildings. Neighboring aerial images are often taken at different times, so that when they are stitched together, there may be a seam where the different images meet. Thus, it is important to remove the shadows in the aerial images, which are shown in Fig. 8. Evaluation. In Fig. 9, we show how our method benefits from the user interaction. The results are evaluated quantitatively using the ground truth. Our results improve
Interactive Shadow Removal from a Single Image Using Hierarchical Graph Cut
( ) (a)
( ) (e)
(b)
(f)
(c)
(g)
243
(h)
(d)
Fig. 7. Our shadow removal results. The first and fourth columns show the input images, the second and fifth columns show the shadow-free images, and the third and sixth columns show the shadow images. Table 2. Computation time for our shadow removal. The first column gives the images from Fig. 7. The second column shows the size of each image. The third column shows the number of user interactions used for parameter optimization. The fourth column shows the computation time for each user interaction. Image (a) wall (b) man (c) grass (d) family (e) women (f) horse (g) statue (h) wine
Image size Strokes† Average time per stroke 640 × 480 [px] 320 × 240 [px] 640 × 480 [px] 640 × 480 [px] 640 × 480 [px] 640 × 480 [px] 320 × 240 [px] 640 × 320 [px]
8 4 N/A 1 12 6 15 40
1.5 [sec] 1.2 [sec] N/A 1.4 [sec] 2.0 [sec] 2.1 [sec] 1.9 [sec] 2.8 [sec]
† = Number of strokes for parameter optimization.
Input
Result
Input
Result
Fig. 8. Application to aerial images. The input image and the shadow-free image are shown.
244
D. Miyazaki, Y. Matsushita, and K. Ikeuchi 20 Error [RMSE]
((a))
25
(b) Input image
Ground Finlayson’s Wu’s truth result result
Error Initial Interactive [RMSE] removal removal [Our result]
Finlayson’s result Wu’s result Our result Finlayson’s result Wu’s ’ result l Our result
0 10 20 30 40 50 Number of strokes for parameter optimization
Fig. 9. Comparison between our method, Finlayson’s method, and Wu’s method. The root mean square error (RMSE) is calculated by comparison with the ground truth. The solid line represents our results and the dashed lines represent Finlayson’s results and Wu’s results.
gradually when the user interacts with the system. The computation times for the results shown in Fig. 9 (a) using a 3 GHz desktop computer are 21 [sec], 336 [sec], and 65 [sec] for the main part of the algorithm for Finlayson’s method [6], Wu’s method [12], and our method until convergence, respectively; while the computation times for Fig. 9 (b) are 8 [sec], 649 [sec], and 53 [sec].
6 Conclusions and Discussions We present a method for user-assisted shadow removal from a single image. We have expressed the shadow opacity with a multi-label MRF and solved it using a hierarchical graph cut. Our hierarchical graph cut algorithm allows the system to run at interactive speeds. The weighting parameters for each cost term are automatically updated using an intuitive user interface. In order to robustly remove the shadows, we have defined several cost terms. Our system does not work well for images which deviate from the ability to represent these cost terms in extreme ways. Using user-supplied hints, the coefficients of each cost term are adjusted, and the method can be applied to both hard shadows and soft shadows. The hierarchical graph cut solves multi-label MRF problems 3 to 16 times faster than α-expansion [2] and Ishikawa’s graph cut [3]. One limitation of this graph cut method is that it requires an initial value. A good initial value is usually available in most applications in the computer vision field, so in most cases this is not a problem. Acknowledgments. The authors thank Tai-Pang Wu, Jun Takamatsu, and Yusuke Sugano for useful discussions.
References 1. Nielsen, M., Madsen, C.B.: Graph cut based segmentation of soft shadows for seamless removal and augmentation. In: Proc. of Scandinavian Conf. on Image Anal. (SCIA), pp. 918– 927 (2007)
Interactive Shadow Removal from a Single Image Using Hierarchical Graph Cut
245
2. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. on Patt. Anal. and Mach. Intell. 23(11), 1222–1239 (2001) 3. Ishikawa, H.: Exact optimization for markov random fields with convex priors. IEEE Trans. on Patt. Anal. and Mach. Intell. 25(10), 1333–1336 (2003) 4. Weiss, Y.: Deriving intrinsic images from image sequences. In: Proc. of Int’l Conf. on Comp. Vis. (ICCV), vol. 2, pp. 68–75 (2001) 5. Matsushita, Y., Nishino, K., Ikeuchi, K., Sakauchi, M.: Illumination normalization with timedependent intrinsic images for video surveillance. IEEE Trans. on Patt. Anal. and Mach. Intell. 26(10), 1336–1347 (2004) 6. Finlayson, G.D., Drew, M.S., Lu, C.: Intrinsic images by entropy minimization. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 582–595. Springer, Heidelberg (2004) 7. Fredembach, C., Finlayson, G.: Simple shadow removal. In: Proc. of Int’l Conf. on Patt. Recog. (ICPR), pp. 832–835 (2006) 8. Tappen, M.F., Freeman, W.T., Adelson, E.H.: Recovering intrinsic images from a single image. IEEE Trans. on Patt. Anal. and Mach. Intell. 27(9), 1459–1472 (2005) 9. Baba, M., Mukunoki, M., Asada, N.: Shadow removal from a real image based on shadow density. In: ACM SIGGRAPH Posters, p. 60 (2004) 10. Arbel, E., Hel-Or, H.: Texture-preserving shadow removal in color images containing curved surfaces. In: Proc. of Comp. Vis. and Patt. Recog. CVPR (2007) 11. Wu, T.P., Tang, C.K.: A bayesian approach for shadow extraction from a single image. In: Proc. of Int’l Conf. on Comp. Vis. (ICCV), vol. 1, pp. 480–487 (2005) 12. Wu, T.P., Tang, C.K., Brown, M.S., Shum, H.Y.: Natural shadow matting. ACM Transactions on Graphics 26(2), 8 (2007) 13. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. In: Proc. of ACM SIGGRAPH, pp. 303–308 (2004) 14. Barrow, H.G., Tenenbaum, J.M.: Recovering intrinsic scene characteristics from images. In: Computer Vision Systems, pp. 3–26 (1978) 15. Levin, A., Zomet, A., Weiss, Y.: Separating reflections from a single image using local features. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR), pp. 306–313 (2004) 16. Rother, C., Kolmogorov, V., Blake, A.: Grabcut — interactive foreground extraction using iterated graph cuts. In: Proc. of ACM SIGGRAPH, pp. 309–314 (2004) 17. D’Elia, C., Poggi, G., Scarpa, G.: A tree-structured markov random field model for bayesian image segmentation. IEEE Trans. on Image Processing 12(10), 1259–1273 (2003) 18. Feng, W., Liu, Z.Q.: Self-validated and spatially coherent clustering with net-structured mrf and graph cuts. In: Proc. of Int’l Conf. on Patt. Recog. (ICPR), pp. 37–40 (2006) 19. Lombaert, H., Sun, Y., Grady, L., Xu, C.: A multilevel banded graph cuts method for fast image segmentation. In: Proc. of Int’l Conf. on Comp. Vis. (ICCV), vol. 1, pp. 259–265 (2005) 20. Nagahashi, T., Fujiyoshi, H., Kanade, T.: Image segmentation using iterated graph cuts based on multi-scale smoothing. In: Proc. of Asian Conf. on Comp. Vis. (ACCV), pp. 806–816 (2007) 21. Darbon, J., Sigelle, M.: Image restoration with discrete constrained total variation. J. Math. Imaging Vis. 26(3), 261–276 (2006) 22. Juan, O., Boykov, Y.: Active graph cuts. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR), pp. 1023–1029 (2006) 23. Lempitsky, V., Rother, C., Blake, A.: Logcut - efficient graph cut optimization for markov random fields. In: Proc. of Int’l Conf. on Comp. Vis., ICCV (2007) 24. Komodakis, N., Tziritas, G., Paragios, N.: Fast, approximately optimal solutions for single and dynamic MRFS. In: Proc. of Comp. Vis. and Patt. Recog. CVPR (2007) 25. Scharstein, D., Pal, C.: Learning conditional random fields for stereo. In: Proc. of Comp. Vis. and Patt. Recog. CVPR (2007)
Visual Saliency Based on Conditional Entropy Yin Li, Yue Zhou, Junchi Yan, Zhibin Niu, and Jie Yang Institute of Image Processing and Pattern Recognition, Department of Automation, Shanghai Jiaotong University {happyharry,zhouyue,yanesta,zbniu,jieyang}@sjtu.edu.cn
Abstract. By the guidance of attention, human visual system is able to locate objects of interest in complex scene. In this paper, we propose a novel visual saliency detection method - the conditional saliency for both image and video. Inspired by biological vision, the definition of visual saliency follows a strictly local approach. Given the surrounding area, the saliency is defined as the minimum uncertainty of the local region, namely the minimum conditional entropy, when the perceptional distortion is considered. To simplify the problem, we approximate the conditional entropy by the lossy coding length of multivariate Gaussian data. The final saliency map is accumulated by pixels and further segmented to detect the proto-objects. Experiments are conducted on both image and video. And the results indicate a robust and reliable feature invariance saliency. Keywords: Saliency Detection, Conditional Entropy, Lossy Coding Length.
1
Introduction
In the real world, human visual system demonstrates remarkable ability in locating the objects of interest in cluttered background, whereby attention provides a mechanism to quickly identify subsets within the scene that contains important information. Besides the scientific goal of understanding this human behavior, a computational approach to visual attention also contributes to many applications in computer vision, such as scene understanding, image/video compression, or object recognition. Our ultimate goal is to develop a computational model to locate objects of interest in different scenes. The human visual system follows a center-surround approach in early visual cortex [1,2]. When the surrounding items resemble close to the central item, the neuron response to the central location is suppressive modulated; but when they differ grossly, the response is excitatory modulated [3]. Therefore, the regions in the visual field, which most differ from their surroundings in certain low level features would automatically pop up. The candidates popped up here are called the proto-object [4,5], which could be further grouped into coherent units to be perceived as real objects. How can a computer vision system benefit from this procedure? H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 246–257, 2010. c Springer-Verlag Berlin Heidelberg 2010
Visual Saliency Based on Conditional Entropy
247
Many efforts have been focused on this topic. Most of them follow a center surround and bottom up approach [6,7,8,9,10]. Typically, these methods are based on a set of biological motivated features, followed by center-surround operations that enhance the local stimuli, and finally a combination step leading into a saliency map. In [11], a model of overt attention with selection based on self information is proposed, where the patches of a image are decomposed into a set of pre-learned basis and kernel density estimation is used to approximate the self-information. In [12,13], saliency models based on spectral information are proposed and provide a novel solution. Inspired by the aforementioned research, we propose to measure the saliency as the the minimum conditional entropy [14], which represents the uncertainty of the center-surround local region, when the surrounding area is given and the perceptional distortion is considered. The minimum conditional entropy is further approximated by the lossy coding length of Gaussian data, which balances between complexity and robustness. The final saliency map accumulated by pixels is an explicit representation of the proto-objects. Thus, we simply segment the proto-objects by thresholding. Fig.1 demonstrates the procedure of our method.
Fig. 1. The procedure of our method: original image, saliency map and possible proto objects (from top to bottom)
Overall, we summarize our main contribution into two folds. First, the perceptional distortion is introduced. And the conditional entropy under the distortion is proposed to measure the visual saliency. Second, due to the strictly local approach, our method is insensitive to ego-motion or affine transform, applicable to both image and video, and free from prior knowledge or pre-training. The paper is organized as follows. A local saliency model based on conditional entropy is presented in Section 2. In Section 3, the saliency map is extended to the proto-object detection. Experimental evaluations and result analysis are performed in Section 4. Finally, Section 5 concludes the paper.
248
2 2.1
Y. Li et al.
Conditional Saliency Lossy Coding Length of Multivariate Gaussian Data
We first present a brief introduction to the lossy coding length of multivariate Gaussian data [15], which is used to calculate the visual saliency in the following section. Assume a set of vectors W = (w1 , w2 , ..., wm ) ∈ Rn×m is presented, a lossy coding scheme L(·) maps W to a sequence of binary bits. Thus, the original vectors can be recovered up to an allowable distortion E[wi − w˜i 2 ] ≤ ε2 . If the data are i.i.d. samples from a multivariate Gaussian distribution, the length of the encoded sequence is denoted by L(W ) n ¯ ¯T n μt μ . m+n Lε (W ) = log2 det(I + W W ) + log (1 + ) (1) 2 2 mε2 2 ε2 m 1 ¯ where μ = m 1 wi and W = [w1 − μ, w2 − μ, ..., wm − μ]. Note that the first term of the equation stands for the coding length required ¯ , and the second term of the equation stands for the additional coding to code W length of the mean vector. The equation also gives a good upper bound for degenerated Gaussian data or subspace-like data [15]. Moreover, the coding length has proven to be effective for clustering [16] and classification [15]. 2.2
Saliency as Conditional Entropy
One basic principle in visual system is to suppress the response to frequently occurring input patterns, while at the same time keeping sensitive to novel input patterns. Thus, we could define the saliency of the center area as the uncertainty of the center-surround local region when the surrounding are is provided. Such uncertainty is naturally measured by the conditional entropy in the information theory. Assume image I is divided into small spatial patches. Denote c ∈ I as a patch of image I. Let S(c) = [s1 , s2 , ...s m ] be the spatial or spatio-temporal surrounding patches of c and SC(c) = S(c) c be the center-surround area, overlapping is allowed for surroundings in order to capture the structure information of the image (Fig.2). We further assume a feature transform F is performed across the image. Let vector x = F c ∈ Rn identify the center patch by sticking its columns. Let S = F S = [F s1 , F s2 , ...F sm ] ∈ Rn×m and SC = F SC(c) identify the surroundings and center-surround area, respectively. ˜ S˜ Suppose distortion ε exists across the local region, i.e. what we perceive x, is not exact what what we see x,S. Thus, the input of our method is some latent ˜ rather than x, S and SC. Even if we do not know the ˜ S˜ and Sc variable x, latent variable, we can expect a certain constrain i.e. the tolerance of perception: ˜ 2 ] ≤ ε2 E[x − x E[F si − F˜si 2 ] ≤ ε2
(2)
The distortion ε stands for the tolerance of perception. Two reasons account for ε: 1)there exists distortion during the acquisition of the image, thus the
Visual Saliency Based on Conditional Entropy
249
Fig. 2. Our center-surround architecture: overlapping surroundings of the center patch
observed data contains noise; 2)there exists distortion during the perception of the human vision system: we are still able to recognize the objects of interest even if the distortion is severe [2]. We believe it is the first time that the perceptional tolerance of distortion is considered in visual saliency. Inspired by the center-surround modulation in human vision system [3], we ˜ under ˜ S) define the local saliency as the minimum conditional entropy H(x| distortion ε: inf
˜ ˜ S) Qx| ˜ (x| ˜ S
˜ s.t. E[x − x ˜ S) ˜ 2 ] ≤ ε2 H(x| (3) E[F si − F˜si 2 ] ≤ ε2
˜ is the conditional entropy by where H(˜ x|S) ∞ ∞ ˜ ˜ ˜ (S) ˜ log2 (Q ˜ (x| ˜ ˜ S˜ ˜ S) = − ˜ S)P H(x| Qx| ˜ (x| ˜S ˜ S ˜ S))dxd S x| −∞
−∞
(4)
˜ − H(S) ˜ = H(SC) ˜ is the conditional probability density function, H(·) is the information ˜ S) Qx| ˜ (x| ˜S entropy [14].The equation (3) is an extension to the concept of rate-distortion function [14] in information theory. ˜ bits of information: we Intuitively, the whole local region contains H(SC) ˜ need H(SC) bits of information to reconstruct its state. If we know the sur˜ bits of information, and the local region roundings S, we have gained H(S) ˜ ˜ S) bits remaining of uncertainty. The larger the uncertainty is, the still has H(x| ˜ = 0 if and only if the center ˜ S) more salient the center would be, vice verse. H(x| c can be completely determined by its surroundings S(c), which indicates high ˜ = H(x) ˜ S) ˜ if and only if x and S are drawn from similarity. Conversely, H(x| two independent random variables, which indicates high saliency.
250
2.3
Y. Li et al.
Gaussian Conditional Saliency
The conditional saliency model is simple in concept, intuitive, and highly flexible. However, the equation (3) is computational intractable. Instead of calculating (3) directly, we approximate the entropy by the lossy coding length. We could simplify the problem by making the following assumption: Assumption:
S and SC are both multivariate Gaussian data.
Extensive research on the statistics of natural images has shown that such a density can be approximated by Generalized Gaussian Distribution [17]. For simplicity, we use a Gaussian distribution instead. Although such a simplified assumption is hardly satisfied in real situations, experimental results indicate a robust solution to the visual saliency. In fact, the assumption can be extended to degenerated Gaussian or sub space like data [15]. By this assumption, the lossy coding length Lε (SC) and Lε (S) are reasonable estimations for H(˜(SC)) and H(˜(S)) under the distortion ε, since the optimal coding length is exactly the information entropy H(·) [14]. Thus, the minimum of the conditional entropy can be approximated by ˜ = H(SC) ˜ − H(S) ˜ ˜ S) H(x| . = Lε (SC) − Lε (S)
(5)
where Lε (S) and Lε (SC) are given by m+n n ¯ ¯T n μts μs log2 det(I + S S ) + log (1 + ) 2 2 mε2 2 ε2 t m+n+1 n ¯ SC ¯ T ) + n log (1 + μsc μsc ) Lε (SC) = log2 det(I + SC 2 2 (m + 1)ε2 2 ε2 (6) Lε (S) =
μs and μsc are the mean vector of S and SC respectively. And S¯ = [F s1 − ¯ = [x − μsc , F s1 − μsc , ..., F sm − μsc ]. μs , ..., F sm − μs ], SC The final saliency of the center can be calculated by sal(c, S(c)) = Lε (SC) − Lε (S)
(7)
The only user defined parameter of our model is the distortion ε, which is discussed in Section 4. Note that the equation (7) corresponds to the incremental coding length in [16], where the coding length is used for classification. Since overlapping is allowed in the surroundings, we always have n < m. Thus, the sal(c, S(c)) can be computed in O(n2 ) time for each patch c. The total time of our algorithm is O(Kn2 ), where K is the number of patches in the image. Moreover, the parallel nature of the our algorithm is similar to the early human visual system.
Visual Saliency Based on Conditional Entropy
3
251
Proto-Object Detection from Saliency Map
The saliency map is an explicit representation of the proto-object, we adopt a simple threshold to segment the proto-object. Given the saliency map sal(I) of image I, we get the proto object map P (I) by 1 if sal(I) ≥ threshold P (I) = (8) 0 otherwise We set threshold = 3 ∗ E(sal(I)) according to our pre-experiments, where E(sal(I)) is the average value of the saliency map. 3.1
The Role of Features
Traditional center surround saliency deals with a set of biology motivated features, such as intensity, orientation, color and motion [6,7,8,9,10]. In general, dozens of feature channels are extracted, and finally combined into a master saliency map. Nevertheless, during the experiment of our method, we find the method insensitive to the features. Though dozens of features may slightly improve the performance in some images, our method provides a fairly good solution with only the pixel value of the image/video. Fig.3 demonstrate a comparison of the features and raw data.
Fig. 3. Though feature integration may improve the performance(see the first row), it may introduce some new problems(see second and third rows)
Certain feature invariance may be explained by the invariance of the coding length under some transforms, e.g. the orthogonal transform will surely preserve the coding length [15]. It may indicate that the traditional feature integration introduce extra redundancy for visual saliency. Further experiment could be found in Section 4. And the reason behind this phenomenon is left as our future work.
252
4
Y. Li et al.
Experiment and Result Analysis
To evaluate the performance of our method, three different kinds of experiments including both image and video are conducted. And the results are compared to the main stream approaches in the field [6,11,12]. We set the saliency map of our method at 160 × 120 in all experiments. The center patch is set 4 × 4 pixel, where its surroundings are set 7 times the size of the center. For simplicity, we also normalize the pixel value into [0 1]. All the tests are executed in Matlab on the platform of linux. 4.1
Evaluation of the Results
One problem with visual saliency in general, is the difficulty in making impartial quantitative comparisons between methods. Since our goal is to locate the objects of interest, the key factor is the accuracy, which could be shown as the Hit Rate(HR) and False Alarm Rate(FAR). The candidate of correct object are labeled by affiliated volunteers from a voting strategy. For each input image I(x), we have a corresponding hand-labeled binary image O(x), in which 1 denotes target objects while 0 denotes background. Given the saliency map S(x), the Hit Rate(HR) and False Alarm Rate(FAR) could be obtained by HR = E(O(x) · S(x)) F AR = E((1 − O(x)) · S(x))
(9)
A good saliency detection should have a high HR together with a low FAR. However, if we adjust the saliency map sal(I) by C · sal(I), we would get C · HR and C · F AR consequently. Therefore, we evaluate the accuracy by the following accuracy rate(AR): HR AR = (10) F AR Higher AR indicates a high HR v.s a low FAR, vice verse. 4.2
Natural Images
In order to give a comprehensive comparison among the methods, 62 natural images are used as a test set, which are also used in [12,18]. First, we tested our method without features, namely using only the pixel values of the images. Next, we tested our method within the framework of feature integration in [6]. The only parameter of our method is the distortion ε. Fig.4 shows the effect of distortion in both situations. As we mentioned in Section 3, the features may even slightly decrease the performance of our method in general. The results also indicate that when ε is sufficiently large, it does little positive effect on the final results. Therefore, we set ε2 = 0.6n in the following experiments. Fig.5 presents some of our detection results.
Visual Saliency Based on Conditional Entropy
253
Fig. 4. Quantitative analysis of ε2 : When the distortion is around 0.6, the experiment shows a high accuracy rate. And it also indicates that feature integration would generally deteriorate the performance in our method.
We also compare our conditional saliency(CS) with Itti’s theory [6], self information [11] and spectral residues [12](see Fig.6). For Itti’s method, all the parameter are set as default [6]. For self information, patches with size 8 × 8 are used. Also 200 natural images are used for training. Furthermore, we set the resolution of its input at 320 × 240 and resize the output into 160 × 120, in order to avoid the boundary effect [11]. For spectral residues, we resize the input into 160 × 120, since its original input resolution is 64 × 64. From Fig. 5 and Fig. 5, we can see that our method generally produce a more acute and discriminant result than previous methods [6,11,12]. Moreover, the method tends to preserve the boundary information, which is beneficial in further perceptional grouping in the post attention stage. That is to say, the proto objects by our method is easier to be grouped into real objects. Table 1 compares the performance in a quantitative way as mentioned in Section 4.1, our method outperforms state-of-the-art methods in the field. The introduction of the perceptional distortion should be responsible for the performance. Table 1. Comparison between [6,11,12] and our method. Note that to achieve a high accuracy rate, the method in [11] is trained by 200 natural images in a higher resolution(320 × 240). Itti’s method Spectral residue Self information CS with features CS without features 1.7605 3.2092 3.5107 3.5224 3.5295
254
Y. Li et al.
Fig. 5. Detection results of our method: our method is able to capture the possible proto objects in high accuracy
4.3
Psychological Patterns
We also test our method on psychological patterns, which are adopted in a series of attention experiments to explore the mechanisms of pre-attentive visual search [19]. The preliminary results can be found in Fig.7. Our method is able to deal with simple psychological patterns. 4.4
Video Sequences
Two videos collected from the internet with the resolution 160 × 120 are used to test the performance of our method. One has a stationary background with moving pedestrians and vehicle; the other one has a highly dynamic background with surfers riding a wave. The center-surround architecture provides our method insensitivity to ego-motion. The parameter is similar to the natural image, except that extra 10 frames are used as the spatio-temporal surroundings of the center patch. Fig.8 presents the results of our methods. Our method shows pretty good performance.
Visual Saliency Based on Conditional Entropy
255
Fig. 6. Comparison of our method with [6,11,12],the first column is the original image, the second column is the hand-labeled image, the third to sixth column are results produced by [6,11,12] and our method, respectively.
256
Y. Li et al.
Fig. 7. Saliency map produced by our method on psychological patterns
Fig. 8. Saliency map produced by our method in videos with different scenes
5
Conclusion and Future Work
In this paper, we propose a novel saliency detection method based on conditional entropy under distortion. Though the visual saliency is still defined locally, we extend the concept of rate-distortion function to measure the saliency of the local region, where perceptional tolerance is considered. For the future work, we would investigate the roles of different features and try to find the reason behind the feature invariance in our method. Moreover, we would like to further explore the application of our method in computer vision.
Acknowledgement The authors would like to thank the reviewers for their valuable suggestions. This research is partly supported by national science foundation, China, No.60675023, No.60772097; China 863 High-Tech Plan, No.2007AA01Z164.
References 1. Cavanaugh, J.R., Bair, W., Movshon, J.A.: Nature and interaction of signals from the receptive field center and surround in macaque v1 neurons. Journal of Neurophysiol. 88, 2530–2546 (2002)
Visual Saliency Based on Conditional Entropy
257
2. Allman, J., Miezin, F., McGuinness, E.: Stimulus specific responses from beyond the classical receptive field: neurophysiological mechanisms for local-global comparisons in visual neurons. Annual Review of Neuroscience 8, 407–430 (1985) 3. Vinje, W.E., Gallant, J.L.: Sparse coding and decorrelation in primary visual cortex during natural vision. Science 287(5456), 1273–1276 (2000) 4. Rensink, R.A.: Seeing, sensing, and scrutinizing. Vision Research 40(10-12), 1469– 1487 (2000) 5. Rensik, R.A., Enns, J.T.: Preemption effects in visual search: Evidence for low-level grouping. Psychological Review 102, 101–130 (1995) 6. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11), 1254–1259 (1998) 7. Itti, L., Koch, C.: A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research 40(10-12), 1489–1506 (2000) 8. Itti, L., Baldi, P.: A principled approach to detecting surprising events in video. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2005, June 2005, vol. 1, pp. 631–637 (2005) 9. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural Information Processing Systems 19, pp. 545–552. MIT Press, Cambridge (2007) 10. Gao, D., Vasconcelos, N.: Bottom-up saliency is a discriminant process. In: ICCV, pp. 1–6 (2007) 11. Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Advances in Neural Information Processing Systems 18, pp. 155–162. MIT Press, Cambridge (2006) 12. Hou, X., Zhang, L.: Saliency detection: A spectral residual approach. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2007, June 2007, pp. 1–8 (2007) 13. Mahadevan, V., Vasconcelos, N.: Background subtraction in highly dynamic scenes. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2008, June 2008, pp. 1–6 (2008) 14. Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379C423, 623C656 (1948) 15. Ma, Y., Derksen, H., Hong, W., Wright, J.: Segmentation of multivariate mixed data via lossy data coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(9), 1546–1562 (2007) 16. Wright, J., Tao, Y., Lin, Z., Ma, Y., Shum, H.Y.: Classification via minimum incremental coding length (micl). In: Advances in Neural Information Processing Systems, vol. 20, pp. 1633–1640. MIT Press, Cambridge (2008) 17. Mallat, S.: A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (1989) 18. Walther, D., Koch, C.: Modeling attention to salient proto-objects. Neural Networks 19(9), 1395–1407 (2006) 19. Wolfe, J.: Guided search 2.0, a revised model of guided search. Psychonomic Bulletin and Review 1(2), 202–238 (1994)
Evolving Mean Shift with Adaptive Bandwidth: A Fast and Noise Robust Approach Qi Zhao, Zhi Yang, Hai Tao, and Wentai Liu School of Engineering, University of California at Santa Cruz, Santa Cruz, CA 95064 {zhaoqi,yangzhi,tao,wentai}@soe.ucsc.edu
Abstract. This paper presents a novel nonparametric clustering algorithm called evolving mean shift (EMS) algorithm. The algorithm iteratively shrinks a dataset and generates well formed clusters in just a couple of iterations. An energy function is defined to characterize the compactness of a dataset and we prove that the energy converges to zero at an exponential rate. The EMS is insensitive to noise as it automatically handles noisy data at an early stage. The single but critical user parameter, i.e., the kernel bandwidth, of the mean shift clustering family is adaptively updated to accommodate the evolving data density and alleviate the contradiction between global and local features. The algorithm has been applied and tested with image segmentation and neural spike sorting, where the improved accuracy can be obtained at a much faster performance, as demonstrated both qualitatively and quantitatively.
1
Introduction
Mean shift (MS) and blurring mean shift (BMS) are nonparametric density based clustering algorithms that have received recent attention [1,2]. Inspired by the Parzen window approach to nonparametric density estimation, both algorithms do not require prior knowledge of cluster numbers, and do not assume a prior model for the shape of the clusters. The bandwidth parameter, however, is the single and critical parameter that may significantly affect the clustering results. Several works [3,4] have recognized the sensitivity of the mean shift and blurring mean shift algorithms to the kernel bandwidth. When the local characteristics of the feature space differ across data, it is difficult to find an optimal global bandwidth [2]. Adopting locally estimated bandwidth has theoretical merits of improving the qualify of density estimate. These methods, however, are heavily relied on the training algorithms, and could result in poor performance if local bandwidths are inappropriately assigned. [4] calculates the bandwidth through a sample point estimate, and the algorithm works well with moderate training procedures. More sophisticated bandwidth estimation method incorporating the input data is reported in [5], with an increased computational complexity and manual efforts from domain experts. The speed of the mean shift algorithm is heavily dependent on the density gradient of the data. In case the feature space includes large portions of flat H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 258–268, 2010. c Springer-Verlag Berlin Heidelberg 2010
EMS with Adaptive Bandwidth: A Fast and Noise Robust Approach
259
plateaus where density gradient is small, the convergence rate of the mean shift procedure is low [1]. The problem is inherent, as the movements of data points are proportional to the density gradient. The blurring mean shift algorithm [6] was proposed to accelerate the convergence rate by moving all data points at each iteration. A notorious drawback of blurring mean shift is that a direction of larger variance converges more slowly rather than the reverse; as a result, blurring mean shift frequently collapses a cluster into a “line” by taking a number of iterations. After that, the blurring mean shift algorithm converges data slowly to the final state and may break the “line” into many segments. In this paper, we present a new clustering algorithm that incorporates the mean shift principle, but is inherently different from the existing mean shift based algorithms. The main novelties of our algorithm are described as follows. First, we use an energy function to describe the data points in terms of compactness. This offers a quantitative way to measure the clustering status. Second, unlike the mean shift algorithm [2] where the data points are static, or the blurring mean shift method [6,1] where all the data points are updated in each iteration, the evolving mean shift algorithm moves one selected point with the largest energy reduction at each iteration. As a result, the evolving mean shift procedure converges at an exponential rate as discussed in section 3.5. Third, the evolving mean shift algorithm automatically handles noisy data early to prevent them misleading the clustering process of other data. Lastly, the bandwidth estimation from the sample point estimators [3,4] is applied for initialization. Unlike blurring mean shift, the bandwidth estimation in the evolving mean shift algorithm is data-driven and adaptively updated.
2
Energy Function
An energy function is defined to evaluate the compactness of the underlying dataset. Formally, given a dataset X = {xi }i=1...N of N points, the energy of X is defined as the sum of energy from individual point xi |i=1...N as E(X) = where Exi =
N
N j=1,j =i
i=1
Exi ,
(Exj .xi + Exi .xj ).
(1)
(2)
In this equation, Exj .xi is the energy contributed by point xj to point xi with kernel K(x) and bandwidth hxi , Exj .xi = f (hxi )(K(0) − K(
xi − xj )), hxi
(3)
where K(x) is an arbitrary isotropic kernel with a convex profile k(x), i.e., it satisfies K(x) = k(|x|2 ) and k(x1 ) − k(x2 ) ≥ k (x2 )(x1 − x2 ). Without loss of x −x generality, we set k(0) = 1 and Eq. 3 reduces to Exj .xi = f (hxi )(1 − K( ihx j )). i f (hxi ) is a shrinking factor that is designed to be a monotonically increasing
260
Q. Zhao et al.
function of bandwidth hxi , as will be discussed in section 3.3. It is worthy mentioning that after assigning an initial global bandwidth h0 , bandwidth h becomes independent to the user and is trained by the evolving density estimates. Let f (0) = 0 and it is straightforward to verify that the energy definition satisfies (1) E(X) ≥ 0; (2) E(X) = 0 when fully clustered.
3
The Evolving Mean Shift (EMS) Clustering Algorithm
We outline the evolving mean shift clustering algorithm as follows: Algorithm 1. The EMS Clustering Algorithm Input: A set of data points X k , where k is the index and is initialized to 0 Output: A clustered set of data points XEMS – Select one data point xki ∈ X k whose movement could substantially reduce the energy as defined in Eq. 1. Point selection is discussed in section 3.1. −−−−→ – Move xki according to the EMS vector, specifically xk+1 = xki + EM Sxk . i – Compute the updated bandwidth hk+1 for point xk+1 according to Algoxi i rithm 2, and adjust the EMS vectors for all points using Eq. 4. – If E(X k ) satisfies the stopping criterion, stop; otherwise, set k ← k + 1 and go to the 1st step.
As will be proven in section 3.2, moving a data point according to the EMS vector lowers the total energy. After each movement, the bandwidth is updated as will be described in section 3.3. The iterative procedure stops when the underlying feature space satisfies the criterion given in section 3.4, and section 3.5 proves the exponentialconvergence N rate of the EMS algorithm. In this section, we use for a moment j for =i j=1,j =i to keep the formulations concise. 3.1
Point Selection
Selecting a point with the largest energy reduction for moving has several important benefits. First, it avoids operations of data that lead to small energy reduction (e.g. data points in plateau regions); therefore, requires less iterations. Second, it efficiently pushes loosely distributed points toward a localized peak, which prevents them being absorbed into nearby clusters with larger densities. Third, noisy data tend to be selected therefore processed at an early stage. To select a data point with the largest energy reduction, at the initialization stage, the EMS vector is computed for each data point. Each following iteration moves a selected point to a new location according to the EMS vector, updates its bandwidth according to Algorithm 2 (section 3.3) and adjusts the EMS vectors for all the data points. Based on the adjusted EMS vectors, a new point corresponding to the largest energy reduction is selected for the next iteration.
EMS with Adaptive Bandwidth: A Fast and Noise Robust Approach
261
Because of this point selection scheme and a bandwidth updating scheme as explained in section 3.3, EMS does not have convergence bias to directions and avoids over-partitioning a cluster into many small segments as blurring mean shift does. Consequentially, EMS can work well without a post combining procedure. A comparison of blurring mean shift and EMS is presented in Figure 1. 3.2
EMS Vector and Energy Convergence
Recall that the energy associated with a selected point xi is defined in Eq. 2 as E xi =
(Exj .xi + Exi .xj ) =
j=i
[f (hxi )(1 − K(
j=i
xi − xj xj − xi )) + f (hxj )(1 − K( ))]. hxi hxj
The gradient of Exi with respect to xi can be obtained by exploring the linearity xj −xi 2 xi −xj 2 xj f (hx )g(| | ) xj f (hx )g(| | ) j hx i hx j i ( + ) 2 h2 h xi xj j=i xj −xi xi −xj 2 2 f (hxi )g(| hxi | ) f (hxj )g(| hxj | ) ( + ) h2 h2 x xj j=i i
of kernel K(x) as ∇Exi = −2[
f (hxj ) f (hxi ) xi −xj 2 x −x g(| jhx i |2 )]. j =i [ h2x g(| hxi | )+ h2x j i
− xi ] ×
The first bracket contains the sample
j
evolving mean shift vector −−−−−→ EM Sxi =
(
xj f (hxi )g(| h2x
j =i
(
xi −xj 2 | ) hx i
h2x
i
xi −xj |2 ) hx i 2 hx i
f (hxi )g(|
j =i
+
xj f (hxj )g(|
+
f (hxj )g(|
xj −xi 2 | ) hx j
j
xj −xi |2 ) hx j
h2x
) − xi .
(4)
)
j
As will be proven in Theorem 1, moving the point along the EMS vector with length no larger than twice of the EMS vector magnitude, the energy reduces. Theorem 1. Energy is reduced by moving the selected point according to the EMS vector.
Proof. After the selected point xi moves to xi , the energy associated with xi is Ex = i
j =i
(Ex .xj + Exj .x ). i
i
(5)
In this proof, we assume that the bandwidths of all the data points remain static. The cases with adaptive bandwidth are validated in section 3.3. Without loss of generality, let xi = 0. Applying the energy definition (Eq. 2) for xi and xi , and considering the convexity of the kernel profile k(x), the energy change of the dataset X is E(X) = Ex − Exi ≤ i
f (hx ) f (hxj ) xj 2 xj 2 ( 2 i g(| | )+ g(| | ))(|xi |2 − 2xi xj ). 2 hxi hx i hxj hx j j =i
(6)
262
Q. Zhao et al.
Applying the definition of EMS vector for xi (Eq. 4) results E(X) = ( Since
j =i
(
f (hxj ) − f (hxi ) xj 2 xj 2 −−−−→ g(| | )+ g(| | )))(|xi |2 − 2xi EM Sxi ) h2xi hx i h2xj hx j
f (hxi ) xj 2 j =i ( h2x g(| hxi | ) i
f (hxj ) xj 2 h2xj g(| hxj | ))
+
is strictly positive, to guarantee
the energy reduction, it is required that − −−−−→ −−−−−→ −−−−−→ |xi |2 − 2xi EM Sxi = |xi − EM Sxi |2 − |EM Sxi |2 ≤ 0
(7)
− −−−−→ −−−−−→ Particularly, |xi |2 − 2xi EM Sxi achieves the minimal value of −|EM Sxi |2 when −−−−−→ xi = EM Sxi . This completes the proof.
3.3
Bandwidth Updating
To calculate the local bandwidth, a pilot density estimate is first calculated as p(xi ) =
1 xi − xj K( ), j =i h0 hd0
(8)
where h0 is a manually specified global bandwidth and d is the dimension of the data space. Based on Eq. 8, local bandwidths are updated as [3,4] hxi = h0 [
λ 0.5 ] , p(xi )
(9)
where p(xi ) is the estimated density at point xi , λ is a constant which is by default assigned to be geometric mean of {p(xi )}|i=1...N . In each iteration, the density estimate associated with the selected point is updated using a sample point density with bandwidth estimated from Eq. 9 as
p(xi ) =
j =i
1 x − xj K( i ). hdxj hx j
(10)
The procedure of updating the bandwidth is summarized as follows: Algorithm 2. Adaptive bandwidth updating using sample point estimator Input: The data point xki that is selected to move in the k th iteration and its corresponding bandwidth hkxi Output: An updated bandwidth hk+1 for the selected point xi – Calculate the updated density estimate p(xk+1 ) for the selected point aci cording to Eq. 10. – Calculate the updated bandwidth hk+1 for the selected point using Eq. 9 xi with the updated pilot density estimate p(xk+1 ). If hk+1 < hkxi , update the xi i k+1 k bandwidth with hk+1 ; otherwise, set h ← h . xi xi xi
EMS with Adaptive Bandwidth: A Fast and Noise Robust Approach
263
During iterations, the bandwidth of each data point adapts to the local density. Though Algorithm 2 only updates bandwidth when it becomes smaller, experiments show that the bandwidth associated with the selected point frequently reduces after the movement. This phenomenon is intuitive, as the EMS iteration compacts a dataset, which leads to a smaller bandwidth according to Eq. 9. To satisfy that Exj .xi is a monotonically increasing function of h(x), we ∂Ex
.x
have ∂hjx i ≥ 0. For both Gaussian and Epanechnikov kernels, the requirements i on f (hx ) are the same f (hx ) ∼ O(hα (11) x ), α ≥ 2 3.4
Stopping Criterion
In this work, we use a broad truncated kernel with an adaptive bandwidth, based on which a reliable stopping criterion using the EMS vector or the total energy can be given. A broad truncated kernel KB (x) is defined as K(x), x < M hx KB (x) = (12) 0, x ≥ M hx , where M is a positive constant satisfying that M hx can cover a large portion or the whole feature space. At an early stage of EMS iterations where a clear configuration of clusters has not been formed, the kernel KB is similar to a broad kernel. Through iterations, the bandwidth reduces and converges to zero. As a result, KB becomes a truncated kernel, which only covers a small region in the feature space and prevents the attraction of different clusters. 3.5
Convergence Rate of the EMS Algorithm with a Broad Kernel
In this section, we use the most widely used broad kernels, i.e., the Gaussian kernel and Epanechnikov kernel, with global bandwidth as examples to validate the fast convergence rate of the EMS algorithm. Theorem 2. The EMS algorithm with a broad kernel converges at an exponential rate. Proof. According to the energy definition (Eq. 3), the energy from point xj to xi using a broad kernel with bandwidth h is Exi .xj = f (h)(1 − K(
xi − xj )). h
(13)
The EMS vector for point xi in this case is calculated according to Eq. 4 as xj −xi 2 −−−−−→ j =i xj g(| h | ) EM Sxi = − xi . (14) xj −xi 2 j =i g(| h | )
264
Q. Zhao et al.
Theorem 2.1 Convergence rate under a broad Gaussian kernel. Begin with the Gaussian kernel, the corresponding energy reduction of moving a point from location xi to xi according to Eq. 6 is
(x −x )2 f (h) −−−−−→ − j h2 i exp (|xi − xi |2 − 2(xi − xi )EM Sxi ) 2 i j =i h (15) −−−−−→ Substituting xi = xi + EM Sxi into Eq. 15 yields
E(X) = Ex − Exi ≥
−−−−−→ E(X) ≥ |EM Sxi |2
j =i
(x −x )2 f (h) − j h2 i exp . h2
(16)
To obtain a convergence rate of the energy function, we project the points in the original d-dimensional space Rd onto an 1-dimensional space R1 , i.e., ∀xi ∈ X ⊂ Rd is projected to ui ∈ U ⊂ R1 . Denoting DX and DU as the maximal distance between points in Rd and its projected distance in R1 , we have DU ≤ DX . −−−−−→ −−−−−→ Further denote EM Sui as the projection of the EMS vector EM Sxi onto U ∈ R1 . Without loss of generality, assume ui − uj ≥ 0 for i ≥ j, we have −−−−−→ −−−−−→ |EM Sxi | > |EM Sui | = |
(xj −xi )2
− h2 j =i uj exp (x −x )2 − j h2 i exp j =i
− ui |.
(17) 2
DX −−−−−→ −−−−−→ −−−−−→ Particularly, for |EM Su1 | and |EM SuN |, we have |EM Su1 | > exp− h2 2 DX (uN −uj ) −−−−−→ |EM SuN | > exp− h2 j=NN −1 . Summing them up gives
j=1 (uj −u1 )
N −1
2
DX −−−−−→ −−−−−→ max(|EM Su1 |, |EM SuN |) ≥ exp− h2 DU /2.
(18)
Clearly DU can be chosen to be as large as DX . Combining Eq. 18 with Eq. 15, the energy reduction induced by moving the point corresponding to the largest energy reduction is max(E(X)) ≥ (N − 1)
D2 f (h) −2 hX 2 exp 2 4h2 DX
(19)
According to the definition of energy (Eq. 1 and Eq. 13), an upper bound of the required amount of iterations from E(X) to arbitrary small number ε is (1 − exp−
2 DX h2
)ln E(X) ε
2 DX −2 4h2 exp
D2 X h2
N ∼ O(N )
(20)
Theorem 2.2 Convergence rate under a broad Epanechnikov kernel. A broad Epanechnikov kernel can be approximated as a broad Gaussian kernel with large bandwidth h DX , x2 −h x2 2 , if |x| < h KE (x) = 1 − h2 ≈ exp (21) 0, otherwise.
,
EMS with Adaptive Bandwidth: A Fast and Noise Robust Approach
800
800
800
800
600
600
600
600
400
400
400
400
200
200
200
0
0
0
0
−200
−200
−200
−200
−400
−400
−400
−400
−600
−600
−600
−600
−400
−200
0
200
400
600
800
−400
−200
0
(a)
200
400
600
200
800
−400
−200
0
(b)
200
400
600
800
−400
(c)
−200
0
200
400
600
800
(d)
800
800
800
800
800
800
800
600
600
600
600
600
600
600
800 600
400
400
400
400
400
400
400
200
200
200
200
200
200
200
0
0
0
0
0
0
0
0
−200
−200
−200
−200
−200
−200
−200
−200
−400
−400
−400
−400
−400
−400
−400
−400
−600
−600 −400
−200
0
200
(e)
400
600
800
−600 −400
−200
0
200
400
600
800
−600 −400
(f)
−200
0
200
400
600
800
−600 −400
(g)
−200
0
200
400
600
800
−600 −400
(h)
265
−200
0
200
400
600
800
(i)
400 200
−600 −400
−200
0
200
400
600
800
−600 −400
−200
(j)
0
200
(k)
400
600
800
−400
−200
0
200
400
600
800
(l)
Fig. 1. Performance comparison of EMS and blurring mean shift. (a) - (d) display the snapshots of EMS at 0, 1, ..., 3 iterations per point. The grouping results shown in (a) are obtained through 5 isolated modes in (d). As a comparison, (e) - (t) display the snapshots of blurring mean shift at 0, 2, 4, ..., 14 iterations per point.
Applying Eq. 20 to the broad Epanechnikov kernel (Eq. 21) gives the upper bound of the number of iterations to converge as (1 − exp− 2 DX 4h2
2 DX h2
exp
)ln E(X) ε
D2 − hX 2
N |hDX ≈ 4ln
E(X) N. ε
(22)
This completes the proof. In practice, the total number of iterations is usually a couple of times the total number of points.
4 4.1
Experiments Experiments with Toy Dataset
In the first set of experiments we compare the EMS and BMS clustering methods using a toy dataset (4000 sample points) that delivers typical challenges of applications (irregular cluster geometry, density variation, sparse region, noise events, etc.). As shown in Figure 1, the first 4 - 6 iterations of BMS collapse the data into “lines”. Afterwards, the convergence speed dramatically reduces. Besides, collapsed “lines” are clearly broken into many segments. As a result, a post processing algorithm is critical. 4.2
Experiments with Image Segmentation
In the second set of experiments we apply the clustering algorithms to segment both grayscale and color images. Formally, each pixel in the image is represented by spatial and range features, i.e., (x, y, I) ∈ R3 for grayscale images and (x, y, R, G, B) ∈ R5 for color images where (x, y) denotes the image coordinate,
266
Q. Zhao et al.
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
10
10
10
10
10
10
20
30
40
50
(a)
60
10
20
30
40
60
10
(b)
20
30
40
50
60
10
(c)
20
30
40
50
60
20
10
10
(d)
20
30
40
50
60
10
20
(e)
30
40
h0 = 0.55
h0 = 0.30
h0 = 0.20
(h)
(i)
(j)
(k)
C=2
C=3
C=4
C=5
(m)
(n)
(o)
(p)
(l)
(r)
(s)
(t)
(u)
50
60
(f)
h0 = 0.85
(g)
(q)
50
(v)
Fig. 2. (a) - (f) EMS results for cameraman with initial bandwidth h0 equal to 50% of the data standard deviation, (a) - (f) correspond to 0, 0.5, 1, 1.5, 2, 3 iterations per point. (g) The original cameraman image (64× 64 pixels), (h) - (k) segmentation results using EMS with different h0 . (l) The original hand image (93×72 pixels), (m) (p) segmentation results with fixed h0 and different cluster numbers C. The clustering procedures are stopped after 5 iterations. (q) - (v) Segmentation results using MS ((q) - (s)) and BMS ((t) - (v)) on hand, copied from [6].
I and (R, G, B) represent the pixel value in a grayscale or color image respectively. Figure 2 (a) - (f) display EMS results for cameraman. When compared with the MS and BMS algorithms [6] (Figure 2 (q) - (v)), the EMS method (Figure 2 (m) - (p)) deals better with textured regions with noticeably less noise. For the experiments on the hand image, the EMS iterative procedure stops after 5 iterations (compared with 20 ∼ 100+ for MS and 10 ∼ 25 for BMS). 4.3
Experiments with Neural Spike Sorting
As a third set of experiment, we conduct quantitative comparisons of the EMS, MS, and BMS algorithms on neural spike sorting [7], using 12 sequences (500 spikes per sequence) from a public spike data set [8]. A performance summary
EMS with Adaptive Bandwidth: A Fast and Noise Robust Approach
267
Table 1. Quantitative results over 12 sequences from a public spike data base [8] Algorithm EMS MS BMS Accuracy 97% 89 % 94% α 0.99±0.11 0.16 ±0.13 0.3±0.23 Iterations 2 ∼ 6 15 ∼ 50+ 8 ∼ 18
is listed in Table 1. The clustering accuracy is calculated as the total error subtracting the same spike detection error [7]. We also study the convergence rates of the EMS, MS and BMS algorithms. Since one necessary condition of the stopping criteria for all algorithms is that the movements of the data points, i.e., the EMS vectors for the EMS algorithm and the root mean square of the MS vectors for the MS and BMS algorithms approach zero, we use the magnitude of the data movement as a quantitative measure for the compactness of the the clusters. In these experiments, the same initial bandwidth is assigned. To test the convergence speed of each algorithm, the EMS/MS vector magnitude is curve fitted with 10−α·N where N is the number of iterations and α is a parameter describing the convergence speed.
5
Conclusions
This paper presents an evolving mean shift algorithm. It defines an energy function to quantify the compactness of the dataset and iteratively collapses the data to isolated clusters in a couple of iterations. The single parameter of bandwidth is initialized based on sample point estimators and updated to accommodate the evolving procedure. The main theoretical contributions in this work are the validations of two theorems stating that the evolving mean shift procedure converges, and further, the energy reduces at an exponential rate which guarantees an extremely efficient convergence. Experiments with different data dimensionality demonstrate the advantage of EMS in terms of accuracy, robustness, speed, and ease of parameter selection. The low computational cost and superior performance makes it suitable to apply in many other practical tasks or subjects in additional to those mentioned in this paper.
References 1. Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(8), 790–799 (1995) 2. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603–619 (2002) 3. Comaniciu, D., Ramesh, V., Meer, P.: The variable bandwidth mean shift and datadriven scale selection. In: IEEE International Conference on Computer Vision, vol. I, pp. 438–445 (2001)
268
Q. Zhao et al.
4. Hall, P., Hui, T., Marron, J.: Improved variable window kernel estimate of probability densities. The Annals of Statistics 23(1), 1–10 (1995) 5. Comaniciu, D.: An algorithm for data-driven bandwidth selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(2), 281–288 (2003) 6. Carreira-perpinan, M.: Fast nonparametric clustering with gaussian blurring meanshift. In: International Conference on Machine Learning, pp. 153–160 (2006) 7. Yang, Z., Zhao, Q., Liu, W.: Spike feature extraction using informative samples. In: Advances in Neural Information Processing Systems, pp. 1865–1872 (2009) 8. Quian Quiroga, R., Nadasdy, Z., Ben-Shaul, Y.: Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering. Neural Computation 16(8), 1661–1687 (2004)
An Online Framework for Learning Novel Concepts over Multiple Cues Luo Jie1,2 , Francesco Orabona1, and Barbara Caputo1 2
1 Idiap Research Institute, Centre du Parc, Martigny, Switzerland ´ Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland {jluo,forabona,bcaputo}@idiap.ch
Abstract. We propose an online learning algorithm to tackle the problem of learning under limited computational resources in a teacher-student scenario, over multiple visual cues. For each separate cue, we train an online learning algorithm that sacrifices performance in favor of bounded memory growth and fast update of the solution. We then recover back performance by using multiple cues in the online setting. To this end, we use a two-layers structure. In the first layer, we use a budget online learning algorithm for each single cue. Thus, each classifier provides confidence interpretations for target categories. On top of these classifiers, a linear online learning algorithm is added to learn the combination of these cues. As in standard online learning setups, the learning takes place in rounds. On each round, a new hypothesis is estimated as a function of the previous one. We test our algorithm on two student-teacher experimental scenarios and in both cases results show that the algorithm learns the new concepts in real time and generalizes well.
1
Introduction
There are many computer vision problems that are intrinsically sequential. In these problems the system starts learning from impoverished data sets and keeps updating its solution as more data is acquired. Therefore the system must be able to continuously learn new concepts, as they appear in the incoming data. This is a very frequent scenario for robots in home settings, where it is very likely to see something unknown [1] in a familiar scene. In such situations the robot cannot wait to collect enough data before building a model for the new concept, as it is expected to interact continuously with the environment. Limited space and computing power may also constrain the algorithm from being actually implemented, considering that the stream of training data can be theoretically infinite. Still, most of the used algorithms for computer vision are intrinsically batch, that is they produce a solution only after having seen enough training data. Moreover they are not designed to be updated often, because most of the time updating the solution is possible only through a complete re-training. A different approach is the online learning framework [2]. This framework is motivated by a teacher-student scenario, that is when a new concept is presented to the machine, the machine (student) can ask the user (teacher ) to provide a H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 269–280, 2010. c Springer-Verlag Berlin Heidelberg 2010
270
L. Jie, F. Orabona, and B. Caputo
label. This scenario would correspond to the case of a user explaining to the robot a detected source of novelty in a scene. The algorithms developed in the online learning framework are intrinsically designed to be updated after each sample is received. Hence the computational complexity per update is extremely low, but their performance is usually lower than similar batch algorithms. Ideally, we would like to have an online method with performance as high as batch based algorithms, with fast learning and bounded memory usage. Existing online algorithms (see for example [3,4]) fail to satisfy all these conditions. The mainstream approaches attempt to keep the same performance of batch based methods while retaining either fast learning or bounded memory growth but not both. On the other hand, multiple-cues/sources inputs guarantee diverse and information-rich sensory data. They make it possible to achieve higher and robust performance in varied, unconstrained settings. However, when using multiple inputs, the expansion of the input space and memory requirements is linearly proportional to the number of inputs as well as the computational time, for both the training and test phase. Some recent works in online learning applied to computer vision include: Monteleoni and K¨ aa¨ri¨ ainen [5] present two active learning algorithms in the online classification setting and test it on an OCR application; Fink et. al. [6] who describe a framework for online learning and present preliminary results on office images and face recognition. Grangier and Bengio [7] propose a PassiveAggressive algorithm [3] for Image Retrieval, which takes advantage of the efficient online learning algorithms. On the multi-cues literature, a recently proposed approach to combine cues in the batch setting is to learn the weights of the positive weighted sum of kernels [8]. Even if many attempts have been done to speed up the training process [9, and references therein], this approach is still slow and does not scale well to big datasets. Moreover these methods are intrinsically non-incremental, hence they cannot be used in a sequential setting. A theoretically motivated method for online learning over multiple cues has been proposed in [10], however they assume that all the cues live in the same space, meaning that the same kernel must be used on all the cues. In this work we tackle the problem of learning from data using an online learning algorithm over multiple visual cues. By combining online learning with multiple cues, we manage to get the best of both worlds, i.e. high performance, bounded memory growth and fast learning time. The proposed algorithm is tested on two experimental scenarios: the first is place recognition which simulates the student-teacher scenario where the robot is shown an indoor environment composed of several rooms (this is the kitchen, this is the corridor, etc), and later it is supposed to localize and navigate to perform assigned tasks. The second is object categorization which simulates the student-teacher scenario where the autonomous agent is presented a collection of new objects. For both scenarios, results show that the algorithm learns the new concepts in real time and generalizes well to new concepts. In the next section we describe the online learning framework and the building blocks that we will use in our online multi-cues architecture (Section 2-3).
An Online Framework for Learning Novel Concepts over Multiple Cues
271
Section 4 describes our experimental findings. Finally, we conclude the paper with a summary and a discussion on possible future research.
2
Online Learning
Online learning is a process of continuous updating and exploitation of the internal knowledge. It can also be thought of as learning in a teacher-student scenario. The teacher shows an instance to the student who predicts its label. Then the teacher gives feedback to the student. An example of this would be a robot which navigates in a closed environment, learning to recognize each room from its own sensory inputs. Moreover, to gain robustness and increase the classification performance, we argue for the need of learning using multiple cues. Hence our goal is to design an online learning algorithm for learning over multiple features from the same sensor, or data from multiple sensors, which is able to take advantage of the diverse and information-rich inputs and to achieve more robust results than systems using only a single cue. In the following we will introduce the online learning framework and we will explain how to extend it to multiple cues. Due to space limitations, this is a very quick account of the online learning framework — the interested readers are referred to [2] for a comprehensive introduction. 2.1
Starting from Kernel Perceptron
In online setting, the learning takes place in rounds. The online algorithm learns the mapping f : X → R based on a sequence of examples {xt , yt }lt=1 , with instance xt ∈ X and label yi ∈ {−1, 1}. We denote the hypothesis estimated after the t-th round by ft . At each round t, the algorithm receives a new instance xt , then it predicts a label yˆt by using the current function, yˆt = sign(ft (xt )), where we could interpret |f (x)| as the confidence in the prediction. Then, the correct label yt is revealed. The algorithm changes its internal model everytime it makes a mistake or the confidence on the prediction is too low. Here, we denote the set of all attainable hypotheses by H. In this paper we assume that H is a Reproducing Kernel Hilbert Space (RKHS) with a positive definite kernel function k : X × X → R implementing the inner product which satisfies the reproducing property, f (x) = k(x, ·), f (·). Perhaps the most well known online learning algorithm is Rosenblatt’s Perceptron algorithm [11]. On the t-th round the instance xt is given, and the algorithm makes a prediction yˆt . Then the true label is revealed: if there is a prediction mistake, i.e. yˆt = yt , it updates the hypothesis, ft = ft−1 + yt k(xt , ·), namely it stores xt in the solution. Otherwise the hypothesis is left intact, ft = ft−1 . Given the nature of the update, the hypothesis ft can be written as a kernel expansion [12], ft (x) = i∈St αi k(xi , x). The subset of instances used to construct the function is called the support set. Although the Perceptron is a very simple algorithm, it has been shown to produce very good results. Several other algorithms (see Passive-Aggressive [3] and the references therein) can be seen as
272
L. Jie, F. Orabona, and B. Caputo
belonging to the Perceptron algorithm family. However, given that they update each time there is an error, if the problem is not linearly separable, they will never stop adding new instances to the support set. This will eventually lead to a memory explosion. As we aim to use the algorithm in applications where data must be acquired continuously in time, a Perceptron algorithm cannot be used as it is. Hence we will use as a basic component of our architecture the Projectron++ algorithm [13]. 2.2
The Projectron++ Algorithm
The Projectron++ [13] algorithm is a Perceptron-like algorithm bounded in space and time complexity. It has a better mistake bound than Perceptron. The core idea of the algorithm comes from the work of Downs et. al. [14] on simplifying Support Vector Machine solutions. Hence, instead of updating the hypothesis every time a prediction mistake is made, or when the prediction is correct with low confidence1 , the Projectron++ first checks if the update can be expressed t−1 as a linear combination of vectors in the support set, i.e. k(xt , ·) = i=1 di k(xi , ·) = Pt−1 (k(xt , ·)), where Pt−1 (·) is the projection operator. The concept of linear independence can be approximated and tuned by a parameter η that measures the quality of the approximation. If the instance can be approximated within an error η, it is not added to the support set but the coefficients in the old hypothesis are changed to reflect the addition of the instance. If the instance and the support set are linearly independent, the instance is added to the set, as Perceptron. We refer the reader to [13] for a detailed analysis.
3
Online Multi-Cues Learning Algorithm
In this section we describe our algorithm for learning over multiple cues. We adapt the idea of high-level integration from the information fusion community (see [15] for a comprehensive survey), and design our algorithm with a two-layers structure. The first layer is composed of different Projectrons++, one for each cue. The second layer learns online a weighted combination of the classifiers of the first layer, hence we interpret the output of the Projectrons++ on the first layer as confidence measures of the different cues. A lot of work has been done on how to select the best algorithm from a pool of prediction algorithms, such as the Weighted Majority algorithm [16]. However, they usually assume black-box classifiers. Here we want to learn the best combination of classifiers, not just picking the best one. Therefore we train both the single cue classifiers and the weighted combination with online algorithms. In the rest of this section, we will describe our algorithm in the binary setup. For multi-class problems, the algorithm is extended using the multi-class extension method presented in [17]; we omit the detailed derivation for lack of space. 1
That is when 0 < yt ft−1 (xt ) < 1.
An Online Framework for Learning Novel Concepts over Multiple Cues
273
Suppose we have N cues of the same data X = {x1 , x2 , . . . , xN }. Each cue is described by a feature vector xi ∈ Rmi , where xi could be the feature vector associated with one feature descriptor or one input sensor. Suppose also we are given a sequence of data {Xt , yt }lt=1 , where yt ∈ {−1, 1}. On round t, in the first layer, N Projectrons++ [13] learn the mapping for each views: fti : xit → hit , where hit ∈ R, for i = 1, 2, . . . , N . On top of the Projectron++ classifiers, a linear Passive-Aggressive [17] algorithm is used to learn a linear weighted combination of the confidence outputs of the classifiers: ωt : ht → R. The prediction of the algorithm is yˆt = sign(ωt · ht ). After each update of the first layer, we also update the confidences on the ˜ i the confiinstances before passing them to the second layer. We denote by h t i ˆ dence of the updated hypotheses and denote by ht the confidence predictions of the Projection++ classifiers before knowing the true label. Hence, instead ˆ i , the linear Passive-Aggressive algoof updating the second layer based on h t ˜ i . This modified updating rithm on top considers the new updated confidence h t rule prevents the errors propagating from the first layer to the second layer, and in preliminary experiments it has shown to be less likely prone to overfitting. We call this algorithm OMCL (Online Multi-Cue Learning, see Algorithm 1).
Algorithm 1. OMCL Algorithm Input: Projectron++ parameter η ≥ 0; Passive-aggressive parameter C > 0. Initialize: f0i = 0, S0i = ∅, where i = 1, 2, . . . , N ; ω0 = 0. for t = 1, 2, . . . , T do Receive new instance xit , where i=1, 2, . . . , N i ˆ it = ft−1 Predict h (xit ), where i=1, 2, . . . , N ˆt ) Predict yˆt = sign(ωt · h Receive label yt for i = 1, 2, . . . , N do ˆ it ) Loss: l1it = max(0, 1 − yt · h i if lt > 0 then Compute projection error Δ ˆ it or Δ ≤ η then if yt = h i i Projection update: fti = ft−1 + αit yt Pt−1 (k(xit , ·)) else i Normal update: fti = ft−1 + yt k(xit , ·) end if ˜ it = fti (xit ) Update hypothesis: h end if end for ˜t ) Loss: l2t = max(0, 1 − yt ωt · h Set: τt = min(C, h˜l2t2 ) t ˜t Update: ωt+1 = ωt + τt yt h end for
274
3.1
L. Jie, F. Orabona, and B. Caputo
Online to Batch Conversion
Online algorithms are meant to be constantly used in teacher-student scenarios. Hence the update process will never stop. However it is possible to transform them to batch algorithms, that is to stop the training and to test the current hypothesis on a separate test set. It is known that when an online algorithm stops the last hypothesis found can have an extremely high generalization error. This is due to the fact that the online algorithms are not converging to a fixed solution, but they are constantly trying to “track” the best solution. If the samples are Independent & Identically Distributed (IID), to obtain a good batch solution one can use the average of all the solutions found, instead of the last one. This choice gives also theoretical guarantees on the generalization error [18]. Our system produces two different hyperplanes at each round, one for each layer. In principle we could simply average each hyperplane separately, but this would break the IID assumption of the inputs for the second layer. So we propose an alternative method: given that the entire system is linear, it can be viewed as producing only one hyperplane at each round, that is the product of the two hyperplanes. Hence we average this unique hyperplane and in the testing phase ˆt . we predict with the formula: sign T1 Tt=1 ωt · h Note that, as ft can be written as a kernel expansion [12], the averaging does not imply any additional computational cost, but just an update of the coefficients of the expansion. We use this approach as it guarantees a theoretical bound and it was also found to perform better in practice.
4
Experiments and Results
In this section, we present an experimental evaluation of our approach on two different scenarios, corresponding to two publicly available databases. The first scenario is about place recognition for a mobile robot, and the experiments were conducted on the IDOL2 dataset [19]. The second scenario is about learning new object categories, and the experiments were conducted on the ETH80 dataset [20]. Both experiments can be considered as a teacher-student scenario, where the system is taught to recognize rooms (or objects) by a human tutor. Therefore the robot has to learn the concepts in real time, and generalize well to new instances. For all experiments, we compared the performance and the memory requirements to the standard Perceptron algorithm by replacing the Projectron++ algorithm in our framework. We also compared our algorithm to two different cues combination algorithms: the “flat” data structure and the majority voting algorithm. The “flat” structure is simply a concatenation of all the features of different cues into a long feature vector, and we trained a Projectron++ classifier for it. The majority voting algorithm predicts the label by choosing the class which receives the highest number of votes. As for a majority voting algorithm in multi-class case, the number of experts (number of cues in our experiments) required by the algorithm which guarantees a unique solution will grow exponentially with the number of classes. Although it does not happen
An Online Framework for Learning Novel Concepts over Multiple Cues
275
very often in practice, we show that sometimes two or more classes receive an equal number of votes, especially when the number of cues is relatively small compared to the number of classes. We determined all of our online learning and kernel parameters via cross-validation. Our implementation of the proposed algorithm uses the DOGMA package [21]; the source code is available within the same package. 4.1
First Scenario: Place Recognition
We performed the first series of experiments on the IDOL2 database [19], which contains 24 image sequences acquired using a perspective camera mounted on two mobile robot platforms. These sequences were captured with the two robots moving in an indoor laboratory environment consisting of five different rooms (one-person office (OO), corridor (CR), two-person office (TO), kitchen (KT) and Printer Area (PA)); they were acquired under various weather and illumination conditions (sunny, cloudy, and night) and across a time span of six months. We considered the scenario where our algorithm has to incrementally update the model, so to adapt to the variations captured in the dataset. For experiments, we used the same setup described in the original paper [19] (Section V, Part B). We considered the 12 sequences acquired by robot Dumbo, and divided them into training and testing sets, where each training sequence has a corresponding one in the test sets captured under roughly similar conditions. a. Perceptron
b. Projectron++ Color CRFH BOW Laser Combination Flat
0.25
0.2
0.15
0.1
0.05
0
0
1000
2000
3000
4000
5000
0.3
Online average number of training mistakes
Online average number of training mistakes
0.3
0.2
0.15
0.1
0.05
0
6000
Color CRFH BOW Laser Combination Flat
0.25
0
1000
Number of training samples
Color CRFH BOW Laser Combination Flat
Average test error rate
0.6
0.5
0.4
0.3
0.2
0.1
0 0
3000
4000
0.7
5000
6000
Color CRFH BOW Laser Combination Flat
0.6
Average test error rate
0.7
2000
Number of training samples
0.5
0.4
0.3
0.2
0.1
1000
2000
3000
4000
Number of training samples
5000
6000
0 0
1000
2000
3000
4000
5000
6000
Number of training samples
Fig. 1. Average online training error rate and recognition error rate for the test set on IDOL2 dataset as a function of the number of training samples
L. Jie, F. Orabona, and B. Caputo
Table 1. Place recognition error rate using different cues after the last training round. Each room is considered separately during testing, and it contributes equally to the overall results as an average. It shows that the OLMC algorithm achieves better performance than that of using each single cue. For the “vote” algorithm, the percentage of test data which have two or more classes receive equal number of votes is reported in the bracket below the error rate. Hence the algorithm can not make a definite prediction. Therefore we considered that the algorithm made a prediction error. cue Color CRFH BOW Laser OLMC Optimal Flat Vote
OO 28.4 14.2 21.5 7.6 7.6 4.9 5.7 11.7 (6%)
CR 9.5 4.1 6.9 3.7 2.5 3.0 2.7 3.3 (2%)
TO 8.1 15.4 17.5 8.5 1.9 3.7 7.5 5.3 (3%)
KT 9.9 15.4 11.4 10.7 6.7 4.0 8.5 5.2 (3%)
PA 31.4 11.1 8.4 12.9 13.3 3.9 11.8 9.1 (7%)
ALL 17.4 12.0 13.1 8.7 6.4 3.9 7.2 6.9 (4%)
2500
Perceptron Projectron++
Color CRFH BOW Laser
2000
Size of the active set
276
1500
1000
500
0
0
2
4
6
8
10
12
14
16
Number of training samples
18
20 x 300
Fig. 2. Average size of support set for different algorithms on IDOL2 dataset as a function of the number of training samples
Similarly, each sequence was divided into five subsequences. Learning is done in chronological order, i.e. how the images were captured during acquisition of the dataset. During testing, all the sequences in the test set were taken into account. In total, we considered six different permutations of training and testing sets. The images were described using three image descriptors, namely, CRFH [22] (Gaussian derivative along x & y direction), Bag-of-Words [23] (SIFT, 300 visual words) and RGB color histogram. In addition, we also used a simple geometric feature from the Laser Scan sensor [24]. Fig. 1 reports the average online training and recognition error rate on the test set, where the average online training error rate is the number of prediction mistakes the algorithm makes on a given input sequence normalized by the length of the sequence. Fig. 2 shows the size of the support sets as a function of the number of training samples. Here, the size of the support set of Projectron++ is close to that of Perceptron. This is because the support set of Perceptron is already T a. Wolmc
T b. Wopt
T T c. Angle(Wolmc , Wopt )
CR
CR
Color
TO
0.8
Color
TO
KT
KT
PA
0.6
PA
OO
OO
CR
0.4
CR
CRFH
TO
CRFH
TO
KT
0.2
KT
PA
PA
OO
OO
CR
CR
BOW
TO
KT
PA
PA
OO
OO
CR
CR
Laser
TO
KT
PA
PA
CR
TO
KT
PA
−0.2
BOW
−0.4
−0.6
Laser
TO
KT
OO
0
TO
KT
80
1
OO
Angle between the online and optimal weights
OO
70
60
50
40
30
20
−0.8 10
OO
CR
TO
KT
PA
−1
0
1000
2000
3000
4000
5000
6000
Number of training samples
Fig. 3. Visualization of the average normalized weights for combining the confidence outputs of the Projectron++ classifiers: a. the weights obtained by our algorithm at last training round; b. the weights obtained by solving the above optimization problem. c. the angles between the two vectors Wopt and Wolmc as function of the number of training samples.
An Online Framework for Learning Novel Concepts over Multiple Cues
277
very compact. Since the online training error rate is low, both algorithms do not update very frequently. In Table 1 we summarize the results using each cue after finishing the last training round. We see that our algorithm outperforms both the “flat” data structure and the majority vote algorithm. The majority vote algorithm could not make a definite prediction on approximately 4% of the test data, because there are two or more classes which received an equal number of votes. Moreover, we would like to see what is the difference in performance between the learned linear weights for combining confidence outputs of the Projectron++ classifiers and the optimal linear weighted solution. In another words, what is the best performance a linear weighted combination rule can achieve? We obtained an optimal combination weights, denoted as Wopt , by solving a convex optimization program (see Appendix A) on the confidence outputs of the Projectron++ classifiers on the test set. We reported the result in Table 1, which shows that our algorithms achieve performance similar to that of the optimal solution. We also visualized the average normalized weights for both the optimal solution and the weights obtained by our algorithm at the last learning round in Fig. 3a&b. From these figures, we can see that the weights on the diagonal of the matrix, which corresponds to the multi-class classifiers’ confidence interpretations on the same target category, have highest values. Fig. 3c reports the average angle between the two vectors Wopt and Wolmc , which is the weights obtained by our algorithms during the online learning process. We can see that the angle between these two vectors gradually converges to a low value. 4.2
Second Scenario: Object Categorization
We tested the algorithm on the ETH-80 objects dataset [20]. The ETH-80 dataset consists of 80 objects from eight different categories (apple, tomato, pear, toycows, toy-horses, toy-dogs, toy-cars and cups). Each category contains 10 objects with 41 views per object, spaced equally over the viewing hemisphere, for a total of 3280 images. We use four image descriptors: one color feature (RGB color histogram), two texture descriptors (Composed Receptive Field Histogram Perceptron
Projectron++
0.6
0.5
0.4
0.3
0.2
0.8
Texture:LxLy Texture:dirC Color Shape Combination Flat
0.7
Average test error rate
Average test error rate
0.7
0.1
Projectron++, sparser
0.8
Texture:LxLy Texture:dirC Color Shape Combination Flat
0.6
0.5
0.4
0.3
0.2
0
500
1000
1500
Number of training samples
2000
2500
0.1
Texture:LxLy Texture:dirC Color Shape Combination Flat
0.7
Average test error rate
0.8
0.6
0.5
0.4
0.3
0.2
0
500
1000
1500
Number of training samples
2000
2500
0.1
0
500
1000
1500
2000
2500
Number of training samples
Fig. 4. Average online recognition error rate for the categorization of never seen objects on ETH-80 dataset as a function of the number of training samples. We see that all algorithms achieve roughly similar performance (Projectron++ is slightly better), the Perceptron converges earlier than the Projectron++ algorithms.
L. Jie, F. Orabona, and B. Caputo
Table 2. Categorization error rate for different objects using different cues after finishing the last training round. We could see that our algorithm outperforms the “Flat” structure, the “Vote” algorithm and the case when using each cue alone. It also shows that some cues are very descriptive of certain objects, but not of the others. For example, the color feature achieves almost perfect performance on tomato, but its performance on other objects is low. It also supports our motivation on designing multi-cues algorithms. For the “vote” algorithm, the percentage of test data which have tow or more classes receives equal number of votes is reported in the bracket. Cues apple car Lx Ly 26.3 4.1 DirC 25.0 21.9 Color 38.9 18.8 Shape 30.4 1.3
cow 52.4 60.7 15.5 28.6
cup 29.3 43.3 16.2 2.5
OLMC
11.1
1.5
17.4
13.7
34.1
44.2
5.0
1.1
16.0
Flat
29.3
1.5
28.2
2.0
32.1
28.0
3.3
35.1
20.0
6.1 (5%)
19.0 (11%)
Vote
dog 29.2 65.9 49.3 33.3
horse pear tomato 49.4 7.0 9.6 55.7 29.8 3.2 57.7 33.0 1.2 28.9 3.4 37.8
24.2 1.3 30.1 11.3 34.0 41.2 3.6 (19%) (1%) (15%) (9%) (18%) (20%) (3%)
all 25.9 38.2 28.8 20.8
2500
Perceptron Projectron++ Projectron++ (sparser)
Texture: LxLy Texture: dirC Color Shape
2000
Size of the active set
278
1500
1000
500
0
0
2
4
6
8
10
12
Number of training samples
14
16
18
x 150
Fig. 5. Average size of support set for different algorithms on ETH80 dataset as a function of the number of training samples.
(CRFH) [22] with two different kinds of filters: Gaussian derivative Lx Ly and gradient direction (DirC) and a global shape feature (centered masks). We randomly selected 7 out of the 10 objects for each category as training set, and the rest of them as test set. All the experiments were performed over 20 different permutations of the training set. We first show the behavior of the algorithms over time. In Fig. 4, we show the average recognition error on never seen objects as a function of the number of learned samples. In the experiments, we used two different setting of η parameters, labeled as Projectron++ and Projectron++, sparser. The growth of the support set as a function of the number of samples is depicted in Fig. 5. We see that the Projectron++ algorithm obtain similar performance as the Perceptron algorithm with less than 3/4 (Projectron++) and 1/2 (Projectron++, sparser) of the size of the support set. Finally, in Table 2 we summarize the error rate using different cues for each category after finishing the last training round (Projectron++).
5
Discussion and Conclusions
We presented an online method for learning from multi-cues/sources inputs. Through experiments on two image datasets, representative of two studentteacher scenarios for autonomous systems, we showed that our algorithm is able to learn a linear weighted combination of the marginal output of classifiers on each sources, and that this method outperforms the case when we use each cue alone. Moreover, it achieves performance comparable to the batch performance [19,20] with a much lower memory and computational cost. We also showed that the budget Projectron++ algorithm had the advantage of reducing the support set without removing or scaling instances in the set. This keeps performance high, while reducing the problem of the expansion of the input space
An Online Framework for Learning Novel Concepts over Multiple Cues
279
and memory requirement when using multiple inputs. Thanks to the robustness gained by using multiple cues, the algorithm could reduce more the support set (e.g. Projectron++, sparser, see Fig. 4c & Fig. 5) without any significant loss in performance. This trade-off would be a potentially useful function for applications working in a highly dynamic environment and with limited memory resources, particularly for systems equipped with multiple sensors. Thanks to the efficiency of the learning algorithms, both learning and predicting could be done in real time with our Matlab implementation on most computer hardwares which run Matlab software. In the future, we would like to explore theoretical properties of our algorithm. It is natural to extend our algorithms to the active learning setup [5] to reduce the effort of data labeling. Meanwhile, it would be interesting to explore the properties of co-training the classifier using the messages passed from some other classifiers with high confidence on predicting certain cues. Acknowledgments. We thank Andrzej Pronobis for sharing the laser sensor features. This work was sponsored by the EU project DIRAC IST-027787.
References 1. Weinshall, D., Hermansky, H., Zweig, A., Jie, L., Jimison, H., Ohl, F., Pavel, M.: Beyond novelty detection: Incongruent events, when general and specific classifiers disagree. In: Proc. NIPS 2008 (2008) 2. Cesa-Bianchi, N., Lugosi, G.: Prediction, learning, and games. Cambridge University Press, Cambridge (2006) 3. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passiveaggressive algorithms. Journal of Machine Learning Research 7, 551–585 (2006) 4. Dekel, O., Shalev-Shwartz, S., Singer, Y.: The Forgetron: A kernel-based Perceptron on a budget. SIAM Journal on Computing 37, 1342–1372 (2007) 5. Monteleoni, C., K¨ a¨ ari¨ ainen, M.: Practical online active learning for classification. In: Proc. CVPR 2007, Online Learning for Classification Workshop (2007) 6. Fink, M., Shalev-Shwartz, S., Singer, Y., Ullman, S.: Online multiclass learning by interclass hypothesis sharing. In: Proc. ICML 2006 (2006) 7. Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(8), 1371–1384 (2008) 8. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., El Ghaoui, L., Jordan, M.I.: Learning the kernel matrix with semidenite programming. Journal of Machine Learning Research 5, 27–72 (2004) 9. Rakotomamonjy, A., Bach, F., Grandvalet, Y., Canu, S.: SimpleMKL. Journal of Machine Learning Research 9, 2491–2521 (2008) 10. Cavallanti, G., Cesa-Bianchi, N., Gentile, C.: Linear algorithms for online multitask classification. In: Proc. COLT 2009 (2009) 11. Rosenblatt, F.: The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 386–407 (1958) 12. Sch¨ olkopf, B., Herbrich, R., Smola, A., Williamson, R.: A generalized representer theorem. In: Proc. COLT 2000 (2000)
280
L. Jie, F. Orabona, and B. Caputo
13. Orabona, F., Keshet, J., Caputo, B.: The Projectron: a bounded kernel-based Perceptron. In: Proc. ICML 2008 (2008) 14. Downs, T., Gates, K., Masters, A.: Exact simplification of support vectors solutions. Journal of Machine Learning Research 2, 293–297 (2001) 15. Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14(5), 449–480 (2004) 16. Littlestone, N., Warmuth, M.: Weighted majority algorithm. In: IEEE Symposium on Foundations of Computer Science (1989) 17. Crammer, K., Singer, Y.: Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research 3, 951–991 (2003) 18. Cesa-Bianchi, N., Conconi, A., Gentile, C.: On the generalization ability of on-line learning algorithms. IEEE Trans. on Information Theory 50(9), 2050–2057 (2004) 19. Luo, J., Pronobis, A., Caputo, B., Jensfelt, P.: Incremental learning for place recognition in dynamic environments. In: Proc. IROS 2007 (2007) 20. Leibe, B., Schiele, B.: Analyzing appearance and contour based methods for object categorization. In: Proc. CVPR 2003 (2003) 21. Orabona, F.: DOGMA: a MATLAB toolbox for Online Learning (2009), http://dogma.sourceforge.net 22. Linde, O., Lindeberg, T.: Object recognition using composed receptive field histograms of higher dimensionality. In: Proc. ICPR 2004 (2004) 23. Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: Proc. ICCV 2003 (2003) 24. Mozos, O.M., Stachniss, C., Burgard, W.: Supervised learning of places from range data using adaboost. In: Proc. ICRA 2005 (2005)
Appendix A ˆi , yi }l be the confidence outputs of the Projectron++ classifiers on the Let {h i=1 ˆ i is drawn from a domain Rm and test set of l instances, where each sample h each label yi is an integer from the set Y = {1, 2, . . . , k}. In the multi-class setup, m = n × k, where n is the number of cues (i.e. number of the Projectron++ classifiers) and k is the number of classes. Therefore, we obtained the optimal linear solution by solving the following convex optimization problem: 1 min ||W||2 + C ξi,j W,ξ 2 C i=1,...,l, j∈yi
ˆ i + ξi,j > W j · h ˆi subject to W yi · h ∀i, j∈yiC
ξi,j ≥ 0 where W is the multi-class linear weighted combination matrix of size k × m, and W r is the r-th row of W. We denote the complement set of yi as yiC = Y \yi . This setting is a generalization of linear binary classifiers. Therefore, the value ˆ i denotes the confidence for the r class and the classifier predicts the label Wr ·h using the function: ˆ yˆ = arg max{W r · h} r=1,...,k
The regularizer ||W|| is introduced to prevent slack variables ξi,j producing solutions close to 0+ . The cost C value is decided through cross validation. 2
Efficient Partial Shape Matching of Outer Contours Michael Donoser, Hayko Riemenschneider, and Horst Bischof Institute for Computer Graphics and Vision, Graz University of Technology {donoser,hayko,bischof}@icg.tugraz.at
Abstract. This paper introduces a novel efficient partial shape matching method named IS-Match. We use sampled points from the silhouette as a shape representation. The sampled points can be ordered which in turn allows to formulate the matching step as an order-preserving assignment problem. We propose an angle descriptor between shape chords combining the advantages of global and local shape description. An efficient integral image based implementation of the matching step is introduced which allows detecting partial matches an order of magnitude faster than comparable methods. We further show how the proposed algorithm is used to calculate a global optimal Pareto frontier to define a partial similarity measure between shapes. Shape retrieval experiments on standard shape databases like MPEG-7 prove that state-of-the-art results are achieved at reduced computational costs.
1
Introduction
Shape matching is a well investigated problem in computer vision and has versatile applications as e. g. in object detection [1,2,3] or image retrieval [4]. The most important part of designing a shape matcher is the choice of the shape representation which has a significant effect on the matching step. Shapes have for example been represented by curves [5], medial axes [6], shock structures [7] or sampled points [8]. In general current shape matching algorithms can be divided into two main categories: global and local approaches. Global matching methods compare the overall shapes of the input objects by defining a global matching cost and an optimization algorithm for finding the lowest cost match. One of the most popular methods for global shape matching is the shape context proposed by Belongie et al. [8]. Their algorithm uses randomly sampled points as shape representation and is based on a robust shape descriptor – the shape context – which allows to formulate the matching step as a correspondence problem. The shape context is the basis for different extensions as proposed by Ling and Jacobs [9] or Scott and Nowak [10]. While such global matching methods work well on most of the standard shape retrieval databases, they cannot handle strong articulation, part deformations or occlusions. For example, shape context is a global descriptor and local articulations influence the description of every sampled point. To reduce this effect H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 281–292, 2010. c Springer-Verlag Berlin Heidelberg 2010
282
M. Donoser, H. Riemenschneider, and H. Bischof
(a) Ref.1
(b) Cand.1 (c) Cand.2
(d) Sim.: 44.19
(e) Sim.: 44.08
Fig. 1. Global shape matching methods like COPAP [10] are not able to handle occlusion or strong articulation, because of internally using global descriptors. The same similarity value for the partially occluded shape and a totally different one is returned.
larger histogram bins are used further away from the point. Although this reduces the problem e. g. occlusions still lead to matching errors as it is illustrated in Figure 1 for the shape context based COPAP framework [10]. These problems are handled well by purely local matching methods as e. g. proposed by Chen et al. [11], which allow accurately measuring local similarity, but in contrast fail to provide a strong global description for accurate shape alignment. In this work, we try to bridge the gap between the two worlds by combining their advantages. We propose a novel shape matching method denoted as IS-Match (Integral Shape Match). We use sampled points along the silhouette as representation and exploit the ordering of the points to formulate matching as order-preserving assignment problem. We introduce a chord angle descriptor which combines local and global information and is invariant to similarity transformations. An integral image based matching algorithm detects partial matches with low computational complexity. The method returns a set of partial sub-matches and therefore also allows matching between differently articulated shapes. The main contributions of this paper are: (1) a chord angle based descriptor combining local and global information invariant to similarity transformations (2) an efficient integral image based matching scheme where matching in practice takes only a few milliseconds and (3) the calculation of a global optimal Pareto frontier for measuring partial similarity. The outline of the paper is as follows. Section 2 describes the partial shape matching concept named IS-Match in detail. Section 3 presents a comprehensive evaluation of the proposed algorithm for shape retrieval experiments on common databases like MPEG-7. All evaluations prove that state-of-the-art results are achieved at reduced computational costs.
Efficient Partial Shape Matching of Outer Contours
2
283
Partial Shape Matching: IS-Match
Our shape matching algorithm named IS-Match takes two shapes as input and returns detected partial matches and a similarity score as result. Section 2.1 describes the shape representation used, which is a sequence of points sampled from the silhouette. Section 2.2 introduces a chord angle based descriptor invariant to similarity transformations. In Section 2.3 an efficient integral image based algorithm for matching the descriptor matrices to each other is outlined, which allows detecting subparts of the contours that possess high similarity with low computational complexity. Section 2.4 describes how a global optimal Pareto frontier is calculated and the corresponding Salukawdze distance is returned as measure for partial similarity. Finally, Section 2.5 analyzes the required computational complexity for single shape matches. 2.1
Shape Representation
The first step of our method is to represent the shapes by a sequence of points sampled from the contour. There are two different variants for point sampling: (a) sampling the same number of points from the contours or (b) equidistant sampling, i. e. fixing the contour length between sampled points. The type of sampling significantly influences the invariance properties of our method. Based on equidistant sampling occlusions as e. g. shown in Figure 1 can be handled but then only shapes at the same scale are correctly matched. By sampling the same number of points our method becomes invariant to similarity transformations, but strong occlusions cannot be handled anymore. In this paper we always use the equidistant sampling for the task of shape retrieval on single scale data sets. Nevertheless all subsequent parts of the method are defined in a manner independent of the sampling type. Therefore, we can switch the sampling type without requiring any modifications of the method. Please note that in other applications, as e. g. shape based tracking [12] or recognizing actions by matching to pose prototypes [13], equidistant sampling should be preferred. Because the proposed shape matching method focuses on analyzing silhouettes (as required in the areas of segmentation, detection or tracking) the sampled points can be ordered in a sequence which is necessary for the subsequent descriptor calculation and the matching step. Thus, any input shape is represented by the sequence of points P1 . . . PN , where N is the number of sampled points. 2.2
Shape Descriptor
The descriptor constitutes the basis for matching a point Pi of the reference shape to a point Qj of the candidate shape. We formulate the subsequent matching step presented in Section 2.3 as an order-preserving assignment problem. Therefore, the descriptor should exploit the available point ordering information. In comparison, the shape context descriptor loses all the ordering information
284
M. Donoser, H. Riemenschneider, and H. Bischof
due to the histogram binning and for that reason does not consider the influence of the local neighborhood on single point matches. We propose a descriptor inspired by chord distributions. A chord is a line joining two points of a region boundary, and the distribution of their lengths and angles was used as shape descriptor before, as e. g. by Cootes et al. [14]. Our descriptor uses such chords, but instead of building distribution histograms, we use the relative orientations between specifically chosen chords. Our descriptor is based on angles αij which describe the relative spatial arrangement of the sampled points. An angle αij is calculated between a chord Pi Pj from a reference point Pi to another sampled point Pj and a chord Pj Pj−Δ from Pj to Pj−Δ by αij = Pi Pj , Pj Pj−Δ ,
(1)
where ( . . .) denotes the angle between the two chords and Pj−Δ is the point that comes Δ positions before Pj in the sequence as is illustrated in Figure 2. Since angles are preserved by a similarity transformation, this descriptor is invariant to translation, rotation and scale.
Fig. 2. Our shape descriptor is based on calculating N angles for each sampled point of the shape. In this case Pi is the reference point and the calculation of the angle αij to the point Pj with Δ = 3 is shown.
In the same manner N different angles αi1 . . . αiN can be calculated for one selected reference point Pi . Additionally, each of the sampled points can be chosen as reference point and therefore a N × N matrix A defined as ⎛ ⎞ α11 · · · α1N ⎜ ⎟ A = ⎝ ... . . . ... ⎠ (2) αN 1 · · · αN N can be used to redundantly describe the entire shape. Obviously, elements on the main diagonal α11 , · · · , αN N are always zero. This descriptor matrix is not symmetric because it considers relative orientations. Please note, that such a
Efficient Partial Shape Matching of Outer Contours
(a) Shape 01
(b) Shape 02
285
(c) Shape 03
Fig. 3. Visualizations of distinct chord angle based shape descriptors. Bright areas indicate parts of the silhouettes which significantly deviate from straight lines.
shape descriptor implicitly includes local information (close to the main diagonal) and global information (further away from the diagonal). Figure 3 shows different descriptor matrices for selected shapes. The descriptor depends on which point is chosen as the first point of the sequence. For example the descriptor matrix A shown before changes to ⎛
A(k)
⎜ ⎜ ⎜ =⎜ ⎜ ⎜ ⎝
αkk .. . α1k .. .
... .. . ... . ..
αk1 .. . α11 .. .
... . .. ... .. .
αk(k−1) .. . α1(k−1) .. .
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
(3)
α(k−1)k . . . α(k−1)1 . . . α(k−1)(k−1) if the k-th point is set as the first point of the sequence. Because only closed boundaries are considered, these two matrices A(k) and A are directly related by a circular shift. Matrix A can be obtained by shifting the A(k) matrix k − 1 rows up and k − 1 columns to the left. This is an important property for the efficient descriptor matching algorithm presented in the next section. 2.3
Matching Algorithm
To find a partial match between two given shape contours R1 and R2 the corresponding descriptor matrices A1 with size M × M and A2 with size N × N are compared. For notational simplicity we assume that M ≤ N . The aim of shape matching is to identify parts of the two shapes that are similar to each other. In terms of comparing the two descriptor matrices this equals to finding r × r sized blocks starting at the main diagonal elements A1 (s, s) and
286
M. Donoser, H. Riemenschneider, and H. Bischof
A2 (m, m) of the two descriptor matrices which yield a small angular difference value Dα (s, m, r) defined by r−1 r−1 1
Dα (s, m, r) = 2 [A1 (s + i, s + j) − A2 (m + i, m + j)]2 r i=0 j=0
(4)
between them. This equation is valid due to the previously explained property that a different starting point just leads to a circular shift of the descriptor matrix as illustrated in Equation 3. To find such blocks all different matching possibilities and chain lenghts have to be considered and the brute-force method becomes inefficient for larger number of points. Therefore, different authors as e. g. [15] proposed approximations where for example only every n-th point is considered as starting point. We propose an algorithmic optimization to overcome the limitations of the brute-force approach, which is based on an adaption of the Summed-Area-Table (SAT) approach to calculate all the descriptor differences Dα (s, m, r). The SAT concept was originally proposed for texture mapping and brought back to the computer vision community by Viola and Jones [16] as integral image. The integral image concept allows to calculate rectangle image features like the sum of all pixel values for any scale and any location in constant time. For calculating the similarity scores for all possible configuration triplets {s, m, r} in the most efficient way N integral images Int1 . . . IntN each of size n M × M are built for N descriptor difference matrices MD defined by n MD = A1 (1 : M, 1 : M ) − A2 (n : n + M − 1, n : n + M − 1) .
(5)
n MD
The difference matrices represent the N possibilities to match the point sequences onto each other. Based on these N integral images Int1 . . . IntN the difference values Dα (s, m, r) can be calculated for every block of any size starting at any point on the diagonal in constant time. As a final result all matching triplets {s, m, r} which provide a difference value Dα (s, m, r) below a fixed threshold are returned. Obviously, the detected matches may overlap. Thus, the final result is obtained by merging the different matches providing a set of connected point correspondences. This ability of matching sub-fragments between two input shapes allows to handle articulations, as it is illustrated in Figure 4, where for two scissors four different partially matched sub-fragments are returned. 2.4
Shape Similarity
It is important to provide a reasonable similarity measure in addition to the identified matching point sequences, e. g. for tasks like shape retrieval. Commonly, a combination of descriptor difference, matched shape distances like the Procrustes distance and bending energy of an estimated transformation like a Thin Plate Spline is used. Since we focus on partial similarity evaluation we adapt a measure described by Bronstein et al. [17]. They proposed to use the
Efficient Partial Shape Matching of Outer Contours
(a) Closed
(b) Open
(c) Desc. Closed
(d) Desc. Open
287
Fig. 4. Articulation invariance is handled by returning a set of partially matching boundary fragments. Corresponding fragments are depicted by the same color.
Pareto-framework for quantitative interpretation of partial similarity. They define two quantities: partiality λ (X , Y ), which describes the length of the parts (the higher the value the smaller the part) and dissimilarity ε (X , Y ), which measures the dissimilarity between the parts, where X and Y are two contour parts of the shape. A pair Φ(X ∗ , Y ∗ ) = (λ (X ∗ , Y ∗ ) , ε (X ∗ , Y ∗ )) of partiality and dissimilarity values, fulfilling the criterion of lowest dissimilarity for the given partiality, defines a Pareto optimum. All Pareto optimums can be visualized as a curve, referred to as the set-valued Pareto frontier. Since finding the Pareto frontier is a combinatorial problem in the discrete case, mostly rough approximations are used as final distance measure. Our matching algorithm automatically evaluates all possible matches for all possible lengths. Therefore, by focusing on the discretization defined by our point sampling, we can calculate a global optimal Pareto frontier, by returning the minimum descriptor difference for all partialities. Finally, to get a single partial similarity value, the so-called Salukwadze distance is calculated based on the Pareto frontier by ds (X, Y ) =
inf
(X∗,Y ∗)
|Φ(X∗, Y ∗)|1 ,
(6)
where |...|1 is the L1-norm of the vector. Therefore, ds (X, Y ) measures the distance from the utopia (0, 0) to the closest point on the Pareto frontier. The Salukwadze distance is then returned as the shape matching similarity score.
288
M. Donoser, H. Riemenschneider, and H. Bischof
Fig. 5. IS-Match returns similarities for all possible matches and lengths which allows calculating a global optimal Pareto frontier. The Salukwadze distance is returned as partial similarity score.
Figure 5 illustrates the calculation of the global optimal Pareto frontier and Salukwadze distance. 2.5
Computational Complexity
An exhaustive search over all possible matches for all possible lengths has a complexity of O(2n+m ). Our proposed approach based on integral image analysis enables matching in O(nm) time, where n and m are the number of sampled points on the two input shapes. We implemented our method in C, which enables shape matching on a desktop PC within milliseconds. For comparison, Table 1 summarizes complexities and runtimes of current state-of-the-art shape matching methods. As it is shown in Section 3 only 30
Table 1. Comparison of computational complexity and runtime in milliseconds for a single match. Please note, that as it is shown in Figure 6 our algorithm only requires 30 points to achieve competitive results on reference data sets. Method Felzenszwalb. [18] Scott [10] IDSC [9] SC [8] Schmidt [19] IS-Match
N 100 100 100 100 200 30
Complexity 3 3
O(m k ) O(mnl) O(m2 n) O(m2 n) O(m2 log(m)) O(nm)
Runtime 500ms 450 ms 310ms 200ms X 3ms
Efficient Partial Shape Matching of Outer Contours
289
sampled points are required to provide close to state-of-the-art shape retrieval results, which is possible within only 3 ms. Please note, that the runtimes may vary due to differences in implementations and machine configurations. But as can be seen in general IS-Match outperforms state-of-the-art concerning computational complexity and actual runtime. To the best of our knowledge this constitutes the fastest method for combinatorial matching of 2D shapes published so far.
3
Experimental Evaluation
To evaluate the overall quality of IS-Match, we first analyzed the influence of the number of sampled points and different parametrizations on the performance of shape retrieval on a common database in Section 3.1. The evaluation shows that only approximately 30 sampled points are required to achieve promising results, where a single match only requires 3 ms of computation time outperforming all other shape matching algorithms by an order of magnitude. Section 3.2 shows results on the largest and currently most important benchmark for evaluating shape matching algorithms, the MPEG-7 database. 3.1
Performance Analysis
To evaluate the influence of the number of sampled points and different parameterizations we applied IS-Match for the task of shape retrieval on the common database of Sharvit et al. [20]. This database consists of 25 images of 6 different classes. Each shape of the database was matched against every other shape of the database and the global optimal Salukwadze distance as described in Section 2.4 was calculated for every comparison. Then for every reference image all the other shapes were ranked by increasing similarity value. To evaluate the retrieval performance the number of correct first-, second- and third ranked matches that belong to the right class was counted. In all the experiments Δ was set to 5, but experimental evaluations with different parameterizations revealed that changing Δ only has a small effect on shape retrieval performance. Figure 6 illustrates the performance of our algorithm on this database, where the sum over all correct first-, second- and third ranked matches is shown. Therefore, the best achievable performance value is 75. We present results of IS-Match in dependence of the number of sampled points. As can be seen by sampling 30 points we achieve the highest score of 25/25, 25/25, 24/25 which represents state-of-the-art for this database as it is shown in Table 2. 3.2
Shape Retrieval on MPEG-7 Database
We further applied IS-Match to the MPEG-7 silhouette database [21] which is currently the most popular database for shape matching evaluation. The database consists of 70 shape categories, where each category is represented by 20 different images with high intra-class variability. The parametrization of
290
M. Donoser, H. Riemenschneider, and H. Bischof
Fig. 6. Retrieval results in dependence of number of sampled points on database of [20] consisting of 25 shapes of 6 different classes. Maximum achievable score is 75.
Table 2. Comparison of retrieval rates on database of [20] consisting of 25 shapes of 6 different classes. The number of correct first-, second- and third ranked matches is shown. Algorithm
Top 1
Top 2
Top 3
Sum
Sharvit et al. [20] Belongie et al. [8] Scott and Nowak [10] Ling and Jacobs [9] IS-Match
23/25 25/25 25/25 25/25 25/25
21/25 24/25 24/25 24/25 25/25
20/25 22/25 23/25 25/25 24/25
64 71 72 74 74
our algorithm is based on the results shown in the previous section. The overall shape matching performance was evaluated by calculating the so-called bullseye rating, in which each image is used as reference and compared to all of the other images. The mean percentage of correct images in the top 40 matches (the 40 images with the lowest shape similarity values) is taken as bullseye rating. The measured bullseye rating for IS-Match was 84.79% and is compared to state-of-the-art algorithms in Table 3. As can be the seen the score is close to the best ever achieved by Felzenszwalb et al. [18] of 87.70%. But please note that [18] uses a much more complex descriptor and requires about 500ms per match. Therefore, analyzing the entire database takes approximately 136 hours for [18], while with IS-Match all similarity scores are provided within a single hour (!).
Efficient Partial Shape Matching of Outer Contours
291
Table 3. Comparison of retrieval rates and estimated overall runtimes in hours (!) for calculating the full N × N similarity matrix on MPEG-7 database consisting of 1400 images showing 70 different classes. Algorithm Score Runtime
4
Mokht. [22]
Belongie [8]
Scott [10]
Ling [23]
Felz. [18]
IS-Match
75.44% X
76.51% 54 h
82.46% 122 h
86.56% 84 h
87.70% 136 h
84.79% 1h
Conclusion
This paper introduced a partial shape matching method denoted as IS-Match. A chord angle based descriptor is presented which in combination with an efficient matching step allows detecting subparts of two shapes that possess high similarity. We proposed a fast integral image based implementation which enables matching two shapes within a few milliseconds. Shape retrieval experiments on common databases like the MPEG-7 silhouette database proved that promising results are achieved at reduced computational costs. Due to the efficiency of the proposed algorithm it is also suited for real-time applications as e. g. in action recognition by matching human silhouettes to reference prototypes or for tracking applications, which will be the focus of future work.
References 1. Opelt, A., Pinz, A., Zisserman, A.: A boundary-fragment-model for object detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 575–588. Springer, Heidelberg (2006) 2. Felzenszwalb, P.: Representation and detection of deformable shapes. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 102–108 (2003) 3. Ferrari, V., Tuytelaars, T., Gool, L.V.: Object detection by contour segment networks. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 14–28. Springer, Heidelberg (2006) 4. Mori, G., Belongie, S., Malik, H.: Shape contexts enable efficient retrieval of similar shapes. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), pp. 723–730 (2001) 5. Sebastian, T.B., Klein, P.N., Kimia, B.B.: On aligning curves. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 25(1), 116–125 (2003) 6. Kuijper, A., Olsen, O.F.: Describing and matching 2D shapes by their points of mutual symmetry. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 213–225. Springer, Heidelberg (2006) 7. Sebastian, T.B., Klein, P.N., Kimia, B.B.: Recognition of shapes by editing shock graphs. In: Proceedings of International Conference on Computer Vision (ICCV), vol. 1, pp. 755–762 (2001) 8. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions of Pattern Analysis and Machine Intelligence (PAMI) 24(4), 509–522 (2002)
292
M. Donoser, H. Riemenschneider, and H. Bischof
9. Ling, H., Jacobs, D.W.: Using the inner-distance for classification of articulated shapes. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 719–726 (2005) 10. Scott, C., Nowak, R.: Robust contour matching via the order-preserving assignment problem. IEEE Transactions on Image Processing 15(7), 1831–1838 (2006) 11. Chen, L., Feris, R., Turk, M.: Efficient partial shape matching using smithwaterman algorithm. In: Proceedings of NORDIA Workshop at CVPR (2008) 12. Donoser, M., Bischof, H.: Fast non-rigid object boundary tracking. In: Proceedings of British Machine Vision Conference, BMVC (2008) 13. Weinland, D., Boyer, E.: Action recognition using exemplar-based embedding. In: Proceedings of Conference on Computer Vision and Pattern Recognition, CVPR (2008) 14. Cootes, T., Cooper, D., Taylor, C., Graham, J.: Trainable method of parametric shape description. Journal of Image Vision Computing 10(5), 289–294 (1992) 15. Osada, R., Funkhouser, T., Chazelle, B., Dobkin, D.: Shape distributions. ACM Transactions on Graphics 21, 807–832 (2002) 16. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), pp. 511–518 (2001) 17. Bronstein, A., Bronstein, M., Bruckstein, A., Kimmel, R.: Partial similarity of objects, or how to compare a centaur to a horse. International Journal of Computer Vision (2008) 18. Felzenszwalb, P.F., Schwartz, J.D.: Hierarchical matching of deformable shapes. In: Proceedings of Conference on Computer Vision and Pattern Recognition, CVPR (2007) 19. Schmidt, F.R., Farin, D., Cremers, D.: Fast matching of planar shapes in sub-cubic runtime. In: Proceedings of International Conference on Computer Vision, ICCV (2007) 20. Sharvit, D., Chan, J., Tek, H., Kimia, B.: Symmetry-based indexing of image database. Journal of Visual Communication and Image Representation 9(4) (1998) 21. Latecki, L.J., Lak¨ amper, R., Eckhardt, U.: Shape descriptors for non-rigid shapes with a single closed contour. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), pp. 424–429 (2000) 22. Mokhtarian, F., Abbasi, S., Kittler, J.: Efficient and robust retrieval by shape content through curvature scale space. In: Proc. of International Workshop on Image Databases and Multimedia Search, pp. 35–42 (1996) 23. Ling, H., Okada, K.: An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI) 29, 840–853 (2007)
Level Set Segmentation Based on Local Gaussian Distribution Fitting Li Wang1, , Jim Macione2, Quansen Sun1 , Deshen Xia1 , and Chunming Li3, 1
3
School of Computer Science & Technology, Nanjing University of Science and Technology, Nanjing, 210094, China 2 Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT 06269, USA Institute of Imaging Science, Vanderbilt University, Nashville, TN 37232-2310, USA
[email protected]
Abstract. In this paper, we present a novel level set method for image segmentation. The proposed method models the local image intensities by Gaussian distributions with different means and variances. Based on the maximum a posteriori probability (MAP) rule, we define a local Gaussian distribution fitting energy with level set functions and local means and variances as variables. The means and variances of local intensities are considered as spatially varying functions. Therefore, our method is able to deal with intensity inhomogeneity. In addition, our model can be applied to some texture images in which the texture patterns of different regions can be distinguished from the local intensity variance. Our method has been validated for images of various modalities, as well as on 3D data, with promising results.
1 Introduction Image segmentation plays an important role in computer vision. Due to the presence of noise, low contrast, and intensity inhomogeneity, it is still a difficult problem in majority of applications. In particular, intensity inhomogeneity is a significant challenge to classical segmentation techniques such as edge detection and thresholding. In fact, problems related to image segmentation have been extensively studied and many different segmentation methods have been developed over the last twenty years. A well-known and successful methods are active contour models, which have been widely used in various image processing applications such as in medical imaging to extract anatomical structures. The existing active contour models can be broadly categorized into two classes: edge-based models [1,2,3,4,5,6] and region-based models [7,8,9,10,11]. Edge-based models utilize the image gradient to identify object boundaries, which are usually sensitive to noise and weak edges [7]. Region-based models do not utilize image gradient but aim to identify each region of interest by using a certain region descriptor, such as
This work is sponsored by Graduate Student Research and Innovation Program of Jiangsu Province. Corresponding author. H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 293–302, 2010. c Springer-Verlag Berlin Heidelberg 2010
294
L. Wang et al.
intensity, color, texture or motion, to guide the motion of the contour [12]. Therefore region-based models generally have better performance in the presence of image noise and weak object boundaries. However, most of the region-based models [7,9,10] rely on global information to guide contour evolution. For example, the well-known piecewise constant (PC) models are based on the assumption that image intensities are statistically homogeneous in each region, and thus fail to segment the images with intensity inhomogeneity. To segment images with intensity inhomogeneity, two similar active contour models [10,8] were proposed under the framework of minimization of the MumfordShah functional [13]. These models, widely known as piecewise smooth (PS) models, can handle intensity inhomogeneity to some extent. However, the computational cost of the PS models is rather expensive due to the complicated procedures involved [14,15]. In practice, intensity inhomogeneity often occurs in many real-world images acquired with different modalities [15]. In particular, it is often seen in medical images, such as microscopy, computer tomography (CT), ultrasound, and magnetic resonance imaging (MRI). For example, MR images are often corrupted by a smoothly varying intensity inhomogeneity or bias field [16] due to non-uniform magnetic field or susceptibility effects. A number of recent works have incorporated local intensity information into the active contour models [14,15,17,18] for more accurate segmentation, especially in the presence of intensity inhomogeneity. These methods draw upon local intensity means, which enable them to cope with intensity inhomogeneity. In [19,20], both local intensity means and variances are used to characterize the local intensity distribution in their proposed active contour models. However, the local intensity means and variances are defined empirically in their models. Similar forms of local intensity means and variances were also introduced in [21] for a statistical interpretation of the Mumford-Shah functional. These methods were evaluated on a few images to show a certain capability of handling intensity inhomogeneity. Some other methods rely on a nonparametric statistical approach in the Markov-random-field framework [22], though different, enjoying many of the same advantages as our model. In this paper, we present a new region-based active contour model in a variational level set formulation for image segmentation. By using a kernel function, we first define a local energy to characterize the fitting of the local Gaussian distribution to the local image data around a neighborhood of a point. The local energy is then integrated over the entire image domain to form a double integral energy: local Gaussian distribution fitting (LGDF) energy. In the resulting curve evolution that minimizes the associated energy functional, the local intensity information is used to compute the means and variances and thus guide the motion of the contour toward the object boundaries. Therefore, it can be used to segment images in the presence of intensity inhomogeneity.
2 Proposed Method: Local Gaussian Distribution Fitting In this section, we propose an implicit active contour model based on local intensity distribution. To effectively exploit information on local intensities, we need to characterize the distribution of local intensities via partition of neighborhood [11]. For each point x in the image domain Ω, we consider a circular neighborhood with a small radius ρ, which is defined as Ox {y : |x − y| ≤ ρ}. Let {Ωi }N i=1 denote a set of disjoint
Level Set Segmentation Based on Local Gaussian Distribution Fitting
295
image regions, such that Ω = ∪N = j, where N refers to the i=1 Ωi , Ωi ∩ Ωj = ∅, ∀i number of regions. The regions {Ωi }N i=1 produce a partition of the neighborhood Ox , i.e., {Ωi ∩ Ox }N i=1 . Now, we focus on segmenting this circular neighborhood Ox based on the framework of maximum a posteriori probability (MAP). Let p(y ∈ Ωi ∩Ox |I(y)) be the a posteriori probability of the subregions Ωi ∩ Ox given the neighborhood gray values I(y). According to the Bayes rule: p(y ∈ Ωi ∩ Ox |I(y)) =
p(I(y)|y ∈ Ωi ∩ Ox )p(y ∈ Ωi ∩ Ox ) p(I(y))
(1)
where p(I(y)|y ∈ Ωi ∩ Ox ), denoted by pi,x (I(y)), is the probability density in region Ωi ∩Ox , i.e., the gray value distribution within this region. p(y ∈ Ωi ∩Ox ) is the a priori probability of the partition Ωi ∩ Ox among all possible partitions of Ox , and p(I(y)) is the a priori probability of gray value I(y), which is independent of the choice of the region and can therefore be neglected. Given all partitions are a priori equally possible [9], i.e., p(y ∈ Ωi ∩ Ox ) = N1 , the term p(y ∈ Ωi ∩ Ox ) can be ignored. Assuming that the pixels within each region are independent [9], the MAP will be achieved only if the product of pi,x (I(y)) N across the regions Ox is maximized: i=1 y∈Ωi ∩Ox pi,x (I(y)). Taking a logarithm, the maximization can be converted to the minimization of the following energy: N ExLGDF − log pi,x (I(y))dy (2) i=1
Ωi ∩Ox
There are various approaches to model the probability densities pi,x (I(y)), including Gaussian density with fixed standard deviation [7], a full Gaussian density [23], or nonparametric Parzen estimator [24,22]. Most image segmentation methods assume a global model for the probability of each region. Consequently, these methods have difficulties in the presence of intensity inhomogeneity. In this paper, we assume the mean and variance of the local Gaussian distribution are spatially varying parameters, i.e., 1 (ui (x) − I(y))2 pi,x (I(y)) = √ exp − (3) 2 2πσi (x) 2σi (x) where ui (x) and σi (x) are local intensity means and standard deviations, respectively. By introducing a weighting function into the functional (2), we define the following objective functional N LGDF Ex = −ω(x − y) log pi,x (I(y))dy (4) i=1
Ωi ∩Ox
where ω(x−y) is a non-negative weighting function such that ω(x−y) = 0 for |x−y| > ρ and Ox ω(x − y)dy = 1. As in [11,15], we choose the weighting function ω as a truncated Gaussian kernel,
|d|2 1 exp − 2σ if |d| ≤ ρ; 2 a ω(d) = 0 if |d| > ρ.
296
L. Wang et al.
With such a truncated Gaussian kernel ω(x − y), the above objective function ExLGDF in Eq. (4) can be rewritten as N ExLGDF = −ω(x − y) log pi,x (I(y))dy (5) i=1
Ωi
as ω(x − y) = 0 for y ∈ Ox . The ultimate goal is to minimize ExLGDF for all the center points x in the image domain Ω, which directs us to define the following double integral energy functional N LGDF E = −ω(x − y) log pi,x (I(y))dy dx (6) i=1
Ωi
We will use one or multiple level set functions to represent a partition {Ωi }N i=1 . This energy can thus be converted to an equivalent level set formulation. Remark 1. The proposed method is essentially different from the methods in [19,20,21] for the following reasons. First, we introduced a localized energy functional using a truncated Gaussian kernel. Second, the energy in our method is a double integral, whereas the energy in [19,20,21] is a single integral. They simply used spatially varying means and variances to replace the constant means and variances in a globally defined energy as in [23,9], without using a Gaussian kernel or any localized energy. However, in their implementation, the Gaussian kernel is introduced to compute spatially varying means and variances, which is not consistent with the theory. 2.1 Level Set Formulation We can use multiple level set functions φ1 , · · · , φn to represent regions {Ωi }N i=1 with N = 2n as in [10]. For simplicity of notation, we denote Φ = (φ1 , · · · , φn ), Θ = 2 (u1 , · · · , uN , σ12 , · · · , σN ). The energy of our method can be defined as N F (Φ, Θ) = ( −ω(x − y) log pi,x (I(y))Mi (Φ(y))dy)dx i=1
+ν
n i=1
L(φi ) + μ
n
P(φi )
(7)
i=1
where ν, μ > 0 are weighting constants, H is the Heaviside function [10], L(φi ) = |∇H(φi(x))|dx is the length term to derive a smooth contour during evolution [7], P(φi ) = 12 (|∇φi (x)| − 1)2 dx is the level set regularization term [6], and Mi (Φ) are N functions of Φ which are designed such that i=1 Mi (Φ) = 1. For the two-phase case (N = 2) and a level set function φ, we define M1 (φ(y)) = H(φ(y)) and M2 (φ(y)) = 1 − H(φ(y)). For the multiphase case such as N = 3 and the two level set functions φ1 and φ2 , we can define M1 (φ1 , φ2 ) = H(φ1 )H(φ2 ), M2 (φ1 , φ2 ) = H(φ1 )(1 − H(φ2 )), and M3 (φ1 , φ2 ) = 1 − H(φ1 ) to obtain a three-phase formulation as in [11]. In practice, Heaviside function H is approximated by a smooth function H defined by H (x) = 12 [1 + π2 arctan( x )]. The derivative of H is the smoothed Dirac delta function δ (x) = H (x) = π1 2 +x 2 . The parameter in H and δ is set to 1.0 as in [7,10].
Level Set Segmentation Based on Local Gaussian Distribution Fitting
297
2.2 Gradient Descent Flow We describe the energy minimization for the three-phase case in this paper (for the twophase case, please refer to our recent paper [25]). By calculus of variations, it can be shown that the parameters ui and σi2 that minimize the energy functional in Eq. (7) for a fixed Φ = (φ1 , φ2 ) are given by ω(y − x)I(y)Mi (Φ(y))dy ui (x) = (8) ω(y − x)Mi (Φ(y))dy and
2
σi (x) =
2
ω(y − x) (ui (x) − I(y)) Mi (Φ(y))dy ω(y − x)Mi (Φ(y))dy
(9)
Note that the formulas of ui (x) and σi (x)2 in Eqs. (8) and (9) are also presented in [19,20,21], which are, however, empirically defined, instead of being derived from a variational principle. Minimization of the energy functional F in Eq. (7) with respect to φ1 and φ2 can be achieved by solving the gradient descent flow equations: ∂φ1 = −δ(φ1 )H(φ2 )(e1 − e2 ) − δ(φ1 )(e2 − e3 ) ∂t
∇φ1 ∇φ1 + νδ(φ1 )div + μ ∇2 φ1 − div |∇φ1 | |∇φ1 | ∂φ2 = −δ(φ2 )H(φ1 )(e1 − e2 ) ∂t
∇φ2 ∇φ2 + νδ(φ2 )div + μ ∇2 φ2 − div |∇φ2 | |∇φ2 |
(10)
(11)
2 where ei (x) = ω(y − x)[log(σi (y)) + (ui (y)−I(x)) ]dy. As proposed in [25,19], the 2σi (y)2 image-based term (e1 − e2 ) or (e2 − e3 ) is less dependant on image contrast, which can be decreased by intensity inhomogeneity. Therefore, the proposed model can also deal with the low contrast, which is often viewed in ultra high field MR images, such as 7T MR images.
3 Results For the experiments in this paper, we use the following default setting of the parameters σ = 3, time step t = 0.1, μ = 1, and ν = 0.0001 × 2552 . We first demonstrate our method in the two-phase case, i.e., the case of N = 2. Fig. 1 shows a comparison of our method with the PC model [7] and the LBF model [26,15]. Fig. 1(a) is a synthetic image containing a star-shaped object and background of the same intensity means but different variances. Intensity inhomogeneity was added in Fig. 1(b). Using the initial contour in Fig. 1(b), the results of the PC model, the LBF model and our method are shown in
298
L. Wang et al.
Figs. 1(c), 1(d) and 1(e), respectively. It can be seen that the PC model fails to extract the object boundaries. With the LBF model, the local intensity means are rather close to each other and the local intensity variance information is not taken into account, which causes errors in the location of object boundaries. By contrast, our method, utilizing local intensity means and variances efficiently, achieves a satisfactory result.
(a)
(b)
(c)
(d)
(e)
Fig. 1. Experiments for a synthetic image: (a) original image with the same intensity means but different variances for the object and background; (b) synthetic image obtained by adding intensity inhomogeneity to (a) and initial contour; (c) result of the PC model; (d) result of the LBF model; (e) result of our method.
We evaluated the performance of the algorithm on a set of in vivo medical images. The first row of Fig. 2 show a vessel image with intensity inhomogeneity. In this vessel image, the vessel boundaries are quite weak. The second and third rows show the results for CT images, which include some rather weak boundaries. Moreover, significant intensity variations exist in these images, which renders a difficult task to recover the whole object boundary if it relies on local intensity means alone. In all these images, our model accurately recovers the object shapes. Fig. 3 shows the results of our multi-phase model for three 3T MR brain images (first three image of the upper row), which have obvious intensity inhomogeneity. The
Fig. 2. Results of our method for medical images. The curve evolution process is shown in every row for the corresponding image.
Level Set Segmentation Based on Local Gaussian Distribution Fitting
299
segmentation results are shown in the lower row. It can be observed that our method recovers the boundaries of white matter (WM), gray matter (GM), and cerebral spinal fluid (CSF) accurately. Our method has been tested on 7T MR images such as the image shown in the last column. The image exhibits not only intensity inhomogeneity but also low contrast. For example, in the middle of the image, the gray values of GM are nearly the same as those of CSF, which renders it a non-trivial task to recover object boundaries. Fortunately, our model is less dependant on image contrast, as proposed in Section 2.2. Therefore, our method can successfully extract the object boundaries, as shown in the last column.
Fig. 3. Application of our method to 3T and 7T brain MR images. Upper row: original images; Lower row: results of our method. The red and blue curves are zero level sets of φ1 and φ2 .
Fig. 4 shows the comparison of the proposed method with the methods of Wells et al. [27] and Leemput et al. [28] on simulated images, as shown in the first column of Fig. 4. These images were obtained from McGill Brain Web (http://www.bic.mni.mcgill.ca/brainweb/) with noise level 1%, and intensity non-uniformity (INU) 80%. The ground truth, the results of Wells et al. , Leemput et al. and the proposed method are shown in columns 2, 3, 43 and 5, respectively. Note that the segmentation results are visualized by displaying i=1 ci Mi , where ci is the average gray value of Mi . To evaluate the performance of segmentation algorithms, we use the Jaccard similarity (JS) [16]. The closer the JS values to 1, the better the segmentation. The JS values for WM and GM obtained by the three methods are listed in Table 1. It can be seen that the JS values of our method are higher than those of Wells’s and Leemput’s methods. Our method has also been applied to the 3D case. For example, Fig. 5 shows the desirable surfaces of WM and GM obtained by our model for this Mcgill Brain data. In addition, our model can be applied to some natural texture images in which the texture patterns of different regions can be distinguished from the local intensity variance. For example, the upper row of Fig. 6 shows the segmentation result of a tiger in the water. The lower row shows the segmentation result of two zebras. There are black areas from the shadow in the background which could be incorrectly segmented as the object. In fact, the local variances of zebras and the black shadow are distinct. Therefore, our method can successfully extract the objects from the background.
300
L. Wang et al.
Fig. 4. Comparison with the methods of Wells et al. and Leemput et al. using simulated data obtained from McGill Brain Web. Column 1: original images; Column 2: ground truth segmentations; Column 3: Wells et al. ; Column 4: Leemput et al. ; Column 5: our method.
Table 1. Jaccard similarity coefficients for the results in Fig. 4 Tissue WM Axial GM WM Coronal GM WM Sagittal GM
Wells 83.39% 84.57% 81.68% 81.12% 85.11% 83.20%
Leemput Our method 82.06% 91.03% 83.53% 85.92% 85.02% 92.86% 83.37% 88.57% 88.20% 94.20% 84.13% 87.24%
Fig. 5. WM and GM surfaces obtained by our model for the Mcgill Brain Data
Level Set Segmentation Based on Local Gaussian Distribution Fitting
301
Fig. 6. Results of our method for texture images. The curve evolution process is shown in every row for the corresponding image.
4 Discussion and Conclusion In this paper, we have proposed a novel region-based active contour model for image segmentation in a variational level set framework. Our model efficiently utilizes the local image intensities which are described by Gaussian distribution with different means and variances. Therefore, our method is able to deal with intensity inhomogeneity. Experimental results also demonstrate desirable performance of our method for images with weak object boundaries and low contrast. However, the proposed method does have some limitations such as high computation cost. Improving the computational scheme is an important area of future work.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int’l J. Comp. Vis. 1(4), 321–331 (1987) 2. Kimmel, R., Amir, A., Bruckstein, A.: Finding shortest paths on surfaces using level set propagation. IEEE Trans. Patt. Anal. Mach. Intell. 17(6), 635–640 (1995) 3. Malladi, R., Sethian, J.A., Vemuri, B.C.: Shape modeling with front propagation: a level set approach. IEEE Trans. Patt. Anal. Mach. Intell. 17(2), 158–175 (1995) 4. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. Int’l J. Comp. Vis. 22(1), 61–79 (1997) 5. Xu, C., Prince, J.: Snakes, shapes, and gradient vector flow. IEEE Trans. Imag. Proc. 7(3), 359–369 (1998) 6. Li, C., Xu, C., Gui, C., Fox, M.D.: Level set evolution without re-initialization: A new variational formulation. In: CVPR, vol. 1, pp. 430–436 (2005) 7. Chan, T., Vese, L.: Active contours without edges. IEEE Trans. Imag. Proc. 10(2), 266–277 (2001) 8. Tsai, A., Yezzi, A., Willsky, A.S.: Curve evolution implementation of the Mumford-Shah functional for image segmentation, denoising, interpolation, and magnification. IEEE Trans. Imag. Proc. 10(8), 1169–1186 (2001) 9. Paragios, N., Deriche, R.: Geodesic active regions and level set methods for supervised texture segmentation. Int’l J. Comp. Vis. 46(3), 223–247 (2002) 10. Vese, L., Chan, T.: A multiphase level set framework for image segmentation using the Mumford and Shah model. Int’l J. Comp. Vis. 50(3), 271–293 (2002)
302
L. Wang et al.
11. Li, C., Huang, R., Ding, Z., Gatenby, C., Metaxas, D., Gore, J.: A variational level set approach to segmentation and bias correction of medical images with intensity inhomogeneity. In: Metaxas, D., Axel, L., Fichtinger, G., Sz´ekely, G. (eds.) MICCAI 2008, Part II. LNCS, vol. 5242, pp. 1083–1091. Springer, Heidelberg (2008) 12. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level set segmentation: Integrating color, texture, motion and shape. Int’l J. Comp. Vis. 72(2), 195–215 (2007) 13. Mumford, D., Shah, J.: Optimal approximations by piecewise smooth functions and associated variational problems. Commun. Pure Appl. Math. 42(5), 577–685 (1989) 14. Piovano, J., Rousson, M., Papadopoulo, T.: Efficient segmentation of piecewise smooth images. In: Sgallari, F., Murli, A., Paragios, N. (eds.) SSVM 2007. LNCS, vol. 4485, pp. 709– 720. Springer, Heidelberg (2007) 15. Li, C., Kao, C., Gore, J., Ding, Z.: Implicit active contours driven by local binary fitting energy. In: CVPR, pp. 1–7 (2007) 16. Vovk, U., Pernus, F., Likar, B.: A review of methods for correction of intensity inhomogeneity in MRI. IEEE Trans. Med. Imag. 26(3), 405–421 (2007) 17. An, J., Rousson, M., Xu, C.: Γ -convergence approximation to piecewise smooth medical image segmentation. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 495–502. Springer, Heidelberg (2007) 18. Lankton, S., Melonakos, J., Malcolm, J., Dambreville, S., Tannenbaum, A.: Localized statistics for DW-MRI fiber bundle segmentation. In: Computer Vision and Pattern Recognition Workshops, pp. 1–8 (2008) 19. Brox, T.: From pixels to regions: Partial differential equations in image analysis. PhD thesis, Saarland University, Germany (2005) 20. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shape knowledge for joint image segmentation and pose tracking. Int’l J. Comp. Vis. 73(3), 242–262 (2007) 21. Brox, T., Cremers, D.: On the statistical interpretation of the piecewise smooth MumfordShah functional. In: Sgallari, F., Murli, A., Paragios, N. (eds.) SSVM 2007. LNCS, vol. 4485, pp. 203–213. Springer, Heidelberg (2007) 22. Awate, S.P., Zhang, H., Gee, J.C.: A fuzzy, nonparametric segmentation framework for DTI and MRI analysis: With applications to DTI-tract extraction. IEEE Trans. Med. Imag. 26(11), 1525–1536 (2007) 23. Zhu, S., Yuille, A.: Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 18(9), 884–900 (1996) 24. Kim, J., Fisher, J., Yezzi, A., Cetin, M., Willsky, A.: Nonparametric methods for image segmentation using information theory and curve evolution. In: ICIP, vol. 3, pp. 797–800 (2002) 25. Wang, L., He, L., Mishra, A., Li, C.: Active contours driven by local Gaussian distribution fitting energy. Signal Processing 89(12), 2435–2447 (2009) 26. Li, C.: Active contours with local binary fitting energy. In: IMA Workshop on New Mathematics and Algorithms for 3-D Image Analysis (2006) 27. Wells, W., Grimson, E., Kikinis, R., Jolesz, F.: Adaptive segmentation of MRI data. IEEE Trans. Med. Imag. 15(4), 429–442 (1996) 28. Leemput, V., Maes, K., Vandermeulen, D., Suetens, P.: Automated model-based bias field correction of MR images of the brain. IEEE Trans. Med. Imag. 18(10), 885–896 (1999)
Categorization of Multiple Objects in a Scene without Semantic Segmentation Lei Yang1 , Nanning Zheng1 , Mei Chen2 , Yang Yang1 , and Jie Yang3 1 3
Xi An Jiaotong University, Shaan Xi 710049, China 2 Intel Labs Pittsburgh, Pittsburgh 15213, USA Carnegie Mellon University, Pittsburgh 15213, USA
Abstract. In this paper, we present a novel approach for multi-object categorization within the Bag-of-Features (BoF) framework. We integrate a biased sampling component with a multi-instance multi-label leaning and classification algorithm into the categorization system. With the proposed approach, we addresses two issues in BoF related methods simultaneously: how to avoid scene modeling and how to predict labels of an image without explicitly semantic segmentation when multiple categories of objects are co-existing. The experimental results on VOC2007 dataset show that the proposed method outperforms others in the challenge’s classification task and achieves good performance in multi-object categorization tasks.
1
Introduction
Global statistical bag-of-feature (BoF) representations has shown good discriminity and robustness in the application of object categorization [1,2,3]. However, in real world applications, especially when there are background clutters or multiple co-existing categories in the scene, the current BoF methods often fail. To avoid scene modeling and solve the multi-category co-existence problem, a straightforward solution is to seek help from detection or localization methods [3,4]. However, this kind of solution is computationally expensive, and sensitive to model parameter changes and environmental variations, thus difficult to be utilized for categorization directly. Some image annotation methods [5,6,7] aim to solve the similar multi-category labeling problem. However, these methods are commonly based on region reasoning and labeling, whose creditability to a great extent depends on the accuracy of another open problem - semantic segmentation. In this paper, we pursue to solve the categorization of multi-object in a scene without explicit detection and segmentation. Firstly, a biased sampling strategy is utilized to lead to a group of different sets of sampled patches for different categories of objects in an image. The basic idea of the biased sampling has been introduced in our previous research [8]. Then we regard the BoF representation of the biased sampling results for each category as an instance and an image as a bag of instances. We utilize the recent multi-instance multi-label (MIML) H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 303–312, 2010. c Springer-Verlag Berlin Heidelberg 2010
304
L. Yang et al.
learning technique [5] to perform multi-object categorization. By such an integration, our method avoid the hard problem of semantic segmentation, keep the robustness of the BoF representation, and at the same time effectively realize the multi-category labeling. We evaluate the proposed algorithm on VOC2007 dataset [9]. Our algorithm shows superior performance to the challenge best results. The further experiments on a specifically designed subset from VOC2007, which is only composed of images with co-existing categories, demonstrate that the proposed method is effective to solve the multi-labeling problem without explicit object segmentation. The classical BoF methods can not be utilized to solve the multi-object categorization problem directly. The main reason is that they apply the globally uniform sampling strategy to the entire scene, i.e., the sampling results and corresponding BoF representations are class independent. Typical sampling methods along this line of attack include dense sampling [10], random sampling [11], and interest point sampling [2,12,13]. To make class independent sampling results efficiently adapt to the multiple category label learning problem, an instance differentiation method is proposed in [14] and attempted in natural scene classification. Researchers have also tried to improve the component of patch sampling or feature selection in BoF framework to realize class dependent statistical representations, such as [8,15,16,17]. These class dependent sampling or feature selection methods can be utilized for the efficient categorization of multiple objects in a scene directly. However, they often lead to a separate hard problem of finding a general algorithm to build accurate and robust object (part) detector. Most recently, L. Yang etc al. proposed a biased sampling strategy to realize more efficient object modeling [8]. By combining two directions’ information loose top-down class prior and bottom-up image-wise saliency, this biased sampling method could return class dependent sampling results while at the same time strike a good balance between accuracy and robustness. Thus we chose this sampling method as a component of algorithm for realizing the categorization problem of multiple co-existing objects. The proposed method is also related to multi-class classification. In the classical multi-class setting, a group of binary classifiers are trained in either one-vs-all or one-vs-one manner [2,11]. However, in real world applications, the predefined categories are usually not mutually exclusive and could co-exist in the same scene. In such applications, the multi-label learning framework is built to reduce the ambiguity of the output space and encode the relationship between categories in the past [14,18]. However, multi-label learning approaches usually neglect the fact that each individual label of the image is actually more closely related to one or more regions instead of the entire image, thus in substance it is based on scene modeling. To solve the scene modeling problem, the multi-instance learning is utilized [4,19], in which an image is viewed as a bag of instances corresponding to the regions from segmentations or annotations. If any of these instances is related to a label, the image will be associated with the label. However, the current multi-instance learning methods mainly focus on the single-label scenario and depends greatly on the semantic segmentation which is a hard problem by itself.
Categorization of Multiple Objects in a Scene
305
Z. Zhou etc al. proposed an multi-instance multi-label learning (MIML) framework [5], which is more reasonable in the real world applications. However, in their experiments, they still combined segmentation results with MIML learning framework and only applied their algorithm to the scene classification. In this paper, by integrating the proposed biased sampling strategy and MIML learning method [5] into the BoF framework, we aim to realize multi-category labeling without explicit segmentations, thus the algorithm should be more robust.
2
Proposed Framework for Multi-object Categorization
In this section, we give an overview on the proposed approach for multi-object categorization under the statistics-based BoF framework. Figure 1 shows the flowchart of the proposed algorithm. Firstly we generate the biased sampling strategy which uses prior probabilistic distributions over the patches of different locations and scales to guide the sampling process. The bias to a certain category of objects and a certain image comes from a combination of the top-down, loose class priors and bottom-up, biological inspired saliency measure as introduced in [8]. The loose top-down class priors are learnt from the training data. Since the proposed biased sampling strategy could help the BoF representation towards class modeling rather than the whole image modeling, we could apply it to the multi-object categorization problem. In particular, we combine the biased sampling strategy with the MIMLSVM algorithm [5] which can handle the ambiguity in the input space (multi-instance) and the output space (multilabel) simultaneously. An image that contains multiple categories of objects can be described by multiple instances and associated with multiple class labels. In our framework, for a predefined list of categories, we utilize the biased sampling strategy to obtain an unordered group of sampling patch sets, then build a set of corresponding histogram based representations based on the sampling results, and consider each of the representations as an instance in the bag (image). By integrating the biased sampling results into the multi-instance multi-labeling learning framework, we could get the labels of the co-existing categories in a Sampling Bias Generation Training data
Sampling bias for each predefined category ×α
+
Biased Sampling
MIML Classification
×(1-α)
Biological Object location Loose class priors probability map inspired saliency (bottom-up) (top-down)
0/1 Dog Bottle
New input
Bag of instances
Sofa
0/1 0/1
Person
0/1
…… Bag of labels
……
Fig. 1. An overview of the proposed framework
306
L. Yang et al.
scene simultaneously without explicit object segmentations while at the same time keep the robustness of statistics-based BoF methods.
3
Bias Generation
In order to realize categorization of multi-category in a scene and to make a good balance between accuracy and robustness, we utilize a biased sampling strategy [8] in generating the class dependent BoF representation. A probabilistic distribution over the patches of different locations and scales is computed to guide the biased sampling on objects of an interested category in a certain image. Loose class priors from top-down analysis are introduced and combined optimally with low level information from bottom-up saliency measurements to conduct the sampling process. We give a brief introduction below on how to generate the class and image dependent sampling bias. Note that All images are preprocessed by segmenting into smaller regions by simple over-segmentation (not by explicit segmentation here). The top-down loose class priors are learnt from images with category labels. Simple region based features such as color, texture, and shape, are utilized to characterize the basic analysis unit, segment. We build vocabularies for each feature channel separately. Then we learn a score to each each region-based feature which indicates how well it discriminates between the object and background based on the image labels. Let Oc (O¯c )indicate the presence(absence) of the object class c in an image, and Fi is one of the entries in certain vocabulary. Then the score can be computed by: ˜ c (Fi ) − 0.5}, Rc (Fi ) = exp{R where ˜ c (Fi ) = P (Oc |Fi ) = R
P (Fi |Oc ) ¯c )) ∈ [0, 1]. (P (Fi |Oc ) + P (Fi |O
(1)
(2)
Higher value of Rc indicates the feature is more discriminative for identifying a category c. After the learning process, we could obtain three feature vocabularies and their corresponding score tables for each class. Then given a new image, we compute features for each region, assign them to their nearest neighbors in color Ci , texture Ti , and shape SHi , vocabularies, and for a certain category c compute three corresponding scores for each region- Rc (Ci ), Rc (Ti ), and Rc (SHi ). The loose top-down object class prior map for class c could be computed by assuming the three features are independent as: (c)
Otd (x, y) = N (Rc (Ci )Rc (Ti )Rc (SHi )), ∀(x, y) ∈ Regioni,
(3)
where N (·) is a normalization function. The bottom-up saliency map Sbu is commonly used to assist localization of objects. Here a method introduced by Itti and Koch [20] is utilized to obtain a
Categorization of Multiple Objects in a Scene
307
biological inspired saliency map. The saliency map is also combined with segments Regioni to preserve weak spatial co-coherency. Sbu (x, y) = max(x,y)∈Regioni S(x, y), ∀(x, y) ∈ Regioni .
(4)
Then we optimally combine both top-down and bottom-up information to formulate the final sampling bias. For class c, the probability distribution T (c) over regions of different locations can be represented as: (c)
T (c) = α(c) Sbu + (1 − α(c) )Otd .
(5)
The optimal liner combined weight α(c) is assumed to make the salient regions in the final map more compact. It is an image and class dependent parameter and could be computed automatically. The compactness of a region is evaluated by formulating T (c) as a probability density function and computing its variances. T (c) (x, y) denotes the likelihood of existing objects of class c at location (x, y). The 2D object location probability distribution is further converted into a 3D representation P (x, y, scale) with x position, y position, and scale being the three dimensions. using the bias generation method introduced above, given a certain image and a predefined category list, we could generate a set of probability distributions as well as a group of corresponding sampling sets.
4
Multi-instance Multi-label Learning Using the Biased Sampling Results
Typical BoF classification algorithms cannot solve the multi-categories co-existence problem. Since the biased sampling strategy could help the BoF representation towards class modeling rather than the whole image modeling, we can perform the multi-object categorization using the multi-instance multilabel classifier adopted from [5]. 4.1
Problem Formulation
An image is described by multiple instances and associated with multiple class labels. In our framework, instead of using a carefully segmented region as an instance in [5], we consider the BoF (histogram) representation of the biased sampling results for each category as an instance. In this way, we keep the robustness of multi-instance learning framework and the BoF representation, while at the same time avoid the hard segmentation problem. Each image i.e., Xu , could be represented as a multi-instance bag: (2) (n) Xu = {x(1) u , xu , . . . , xu }.
(6)
Let χ denote the instance space and ζ the set of class labels. There are n pre(1) (2) (l) defined categories. Xu ⊆ χ, and Yu ⊆ ζ is a set of labels {yu , yu , . . . , yu }.
308
L. Yang et al.
Here l denotes the number of labels in Yu . Then the task is to learn a function fMIML : 2χ → 2ζ from a given m images training set {(Xu , Yu )}, u = (1, 2, ..., m). There are two points we need to clarify here: a. We ignore the relationship between an instance (a set of biased sampling results) and its category labels introduced in section 3. That is to say we obtained an un-ordered group of sampling result sets. Actually by relaxing the constraints, we further degrade the risk of introducing inaccurate class priors and thus increase the robustness of categorization algorithm. b. By utilizing the MIML learning method [5], to some extent the latent relationships between co-existing categories has been considered and encoded in the final classification models. 4.2
Multi-instance Multi-label Learning and Classification
Let Γ = {Xu |u = 1, 2, ..., m} indicate the data obtained from the training images after carrying out the algorithm in section 3. Then, k-medoids clustering is performed on Γ . Since each data item in Γ is a multi-instance bag instead of a single instance, we employ Hausdorff distance to measure the distance. In detail, (1) (2) (n) (1) (2) (n) given two bags Xi = {xi , xi , . . . , xi } and Xj = {xj , xj , . . . , xj }, the Hausdorff distance between Xi and Xj is defined as dH (Xi , Xj ) = max{ max min xi − xj , max min xj − xi }, xi ∈Xi xj ∈Xj
xj ∈Xj xi ∈Xi
(7)
where xi − xj measures the Euclidean distance between xi and xj . After the clustering process, we divide the data set Γ into k partitions whose medoids are Mt (t = 1, 2, . . . , k), respectively. With the help of these medoids, we transform the original multi-instance example Xu into a k-dimensional numerical vector Zu , of which the ith (i = 1, 2, . . . , k) component, Zui , is the distance dH (Xu , Mi ) between Xu and Mi . In other words, Zui encodes some structure information of the data, that is, the relationship between Xu and the ith partition of Γ . Thus, the original MIML examples (Xu , Yu )(u = 1, 2, . . . , m) have been transformed into multi-label examples (Zu , Yu )(u = 1, 2, . . . , m). Then, for each y ∈ ζ, we derive a data set Dy = {(Zu , φ(Zu , y))|u = 1, 2, . . . , m}, where φ(Zu , y) = +1 if y ∈ Yu and −1 otherwise. And then a SVM classifier hy is trained with Dy . In classification, the new test example X∗ is labeled by all the class labels with positive SVM scores, except that when all the SVM scores are negative, the test example is labeled by the class label which is with the top (least negative) score. This could be represented as: Y∗ = {arg max hy (Z∗ )} ∪ {y|hy (Z∗ ) ≥ 0, y ∈ ζ}, y∈ζ
(8)
where Z∗ = (dH (X∗ , M1 ), dH (X∗ , M2 ), . . . , dH (X∗ , Mk )). By integrating the biased sampling results into the MIML learning framework, we could get the labels of categories in a scene simultaneously without explicit object segmentations, thus could eschew the error introduced by semantic region divisions and decrease the computation cost by considering less instances. In the experiment, we set k = 500.
Categorization of Multiple Objects in a Scene
5
309
Experiments
We carried out experiments on VOC2007 classification dataset [9], which contains 20 classes. We chose this dataset for three reasons: a. VOC2007 is one of the most popular dataset for object categorization; b. It is the latest dataset whose full annotations are publicly available in Pascal Challenge series, which makes it possible for us to design specific experiments for evaluations; c. About 1760 images in its test set are associated with more than one category labels, thus there are enough data for evaluating our proposed multi-object categorization method. We designed two experiments based on this dataset. Firstly, we evaluate the performance of the proposed method on the general task of object categorization on the whole VOC2007 classification dataset. In total, there are 5011 images for training and 4952 images for testing. We used published competition results on this dataset in Pascal Challenge as a baseline. Our own system is similar to the LEAR-flat system [21] in Pascal challenges except that we use our proposed biased sampling strategy. To make experiments comparable, we kept most of the components in the framework and parameter setting the same as reported by [21]. However, for the component of patch sampling, the proposed biased sampling strategy (BS for abbreviation in this section) is utilized instead, and for the component of classifier, both 1vsAll SVM (as in LEAR-flat) and MIMLSVM have been tested in our system. For each image, we fixed the number of sampled patches to 2000. Note that this number is still smaller than that of the original LEAR-flat system. Average precisions by category and median average precision for 20 categories are computed. Results of our system (BS+LEAR-flat and BS+LEAR-flat+MIMLSVM), LEAR-flat [21], and the challenge winner [9] are compared in table 1. Our system achieves higher median average precision rate than that of LEAR-flat, and is even superior than the winner of Pascal Challenge with either of the classifiers. It is an evidence that the proposed sampling strategy led the BoF based methods toward a superior performance in object categorization on VOC2007 even with fewer sampled patches. The second experiment was specifically designed for evaluating the performance of the proposed method in multi-object categorization. A subset with 1760 images was extracted from VOC2007 test set for testing, which is only composed of images with multiple co-existing categories. The average number of categories per image in this subset is 2.17. In this paper, a main contribution is to perform multi-object categorization within the global statistics-based framework and without semantic segmentation. To demonstrate the superiority of the proposed algorithm, the algorithms which could achieve the similar goal are compared. There are two main lines of attack. The first kind is designed and developed under the BoF framework. We fixed other components as in the first experiment except for the sampling component and the classifier. Here,three state-of-the-art sampling algorithms are compared: Harris of Laplace detector, random sampling [11], and Parikh et al.’s class context saliency based sampling method [17] (HL, RS, and CS for abbreviation in this section). While for the classifier, 1vsAll SVM (BSVM for short) and the MIMLSVM classifier
310
L. Yang et al.
Table 1. Comparisons of average precision by method and by class (we only list the first 8 classes due to the space limit) on VOC2007. See text for details of each method. Methods
aeroplane bicycle bird
challenge winner[9] LEAR-flat[21] BS + LEAR-flat BS+LEAR-flat +MIMLSVM (ours)
boat bottle bus
car
cat
MAP
0.775 0.748 0.789
0.636 0.561 0.719 0.331 0.606 0.780 0.588 0.594 0.625 0.512 0.694 0.292 0.604 0.763 0.576 0.575 0.651 0.554 0.728 0.322 0.627 0.795 0.588 0.607
0.784
0.664 0.551 0.733 0.329 0.634 0.806 0.586 0.621
Table 2. Comparisons of average precision by method and by class (first 8 classes) on the subset for multi-object categorization. See text for details of each method. Methods Chance HL+BSVM HL+MIMLSVM RS[11]+BSVM RS[11]+MIMLSVM CS[17]+BSVM CS[17]+MIMLSVM Seg+MIMLSVM[5] BS+BSVM BS+MIMLSVM
aeroplane bicycle bird 0.018 0.521 0.595 0.523 0.610 0.577 0.621 0.628 0.637 0.642
0.090 0.454 0.603 0.432 0.609 0.532 0.621 0.589 0.604 0.625
0.014 0.189 0.223 0.176 0.218 0.312 0.364 0.338 0.298 0.337
boat bottle 0.029 0.574 0.637 0.583 0.665 0.670 0.687 0.686 0.696 0.699
0.089 0.211 0.267 0.174 0.239 0.296 0.329 0.340 0.301 0.331
bus
car
cat
MAP
0.052 0.504 0.597 0.518 0.581 0.589 0.593 0.597 0.602 0.606
0.170 0.577 0.699 0.537 0.592 0.619 0.695 0.671 0.703 0.734
0.039 0.299 0.355 0.298 0.387 0.440 0.480 0.511 0.456 0.502
0.109 0.417 0.497 0.423 0.488 0.514 0.537 0.529 0.541 0.559
are chosen again for comparisons. Note that we utilized instance differentiation method in [14] to make the class independent sampling results from these samplers adapt to the MIMLSVM learning settings. Thus by different combinations, we have 7 baseline systems in this line of attack(HL+BSVM, HL+MIMLSVM, RS+BSVM, RS+MIMLSVM, CS+BSVM, CS+MIMLSVM, and BS+BSVM). The second kind of methods are based on careful segmentations. to illustrate the potential of the statistics-based method in multi-object categorization, we also chose a representative segments reasoning based multi-category labeling algorithm [5] (Seg+MIMLSVM for short) for comparison. In summary, there are 8 baseline systems in total. Results by the proposed system, these baseline systems, and also chance are compared in table 2. For all the baseline techniques, we fixed the number of sampling patches to be 2000, except that we fixed the cornerness threshold to be zero for Harris-Laplace detector. Since the proposed method puts more emphasis on objects from the class of interest and returns class-dependent sampling results by a biased sampling strategy, even with the setting for single category classification (referred to using BSVM), the system outperforms other state-of-the-art systems evidently in multi-object categorization. Some sampling results by the proposed biased sampling strategy are illustrated in fig. 2 (300 patches are sampled per image
Categorization of Multiple Objects in a Scene person
horse
person
car
cat
a person
c
311
sofa
bottle
sofa
dog
b
d
Fig. 2. Some sampling results by the proposed biased sampling strategy. Sampling results for certain category are labeled with the category’s name.
here). The results in table 2 also show that by combining the MIML learning algorithm the accuracy for multi-object categorization can be further improved. The further comparisons between our system and the explicit segmentation based reasoning method [5] illustrated the potential of improving global statistics-based framework on the task of multi-object categorization.
6
Conclusions
In this paper, we have addressed two problems in statistics-based BoF methods simultaneously: scene modeling and multiple categories co-existing. We have utilized a biased sampling strategy which combines both the bottom-up saliency information and top-down loose class prior to put more focus on objects of interest robustly. Then We integrated this biased sampling component with a multi-instance multi-label learning and classification algorithm. With the proposed algorithm, we can perform the categorization of multi-object in a scene without explicitly semantic segmentation. The experiment results on VOC2007 demonstrate that the proposed algorithm can effectively solve above mentioned problems, while at the same time keep the robustness of BoF methods. Acknowledgments. This research was partially supported by China Scholarship Council, the State Key Program of National Natural Science of China(Grant No.60635050), and NIH under Genes, Environment and Health Initiative of US(Grant No.U01HL91736). The last Author was partially supported by National Science Foundation of USA.
References 1. Dance, C., Williamowski, J., Fan, L., Bray, C., Csurka, G.: Visual categorization with bags of keypoints. In: Proc. ECCV 2004 Intl. Workshop on Statistical Learning in Computer Vision, pp. 59–74 (2004)
312
L. Yang et al.
2. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. IJCV 73(2), 213–238 (2007) 3. Shotton, J., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: Proc. CVPR 2008 (2008) 4. Galleguillos, C., Babenko, B., Rabinovich, A., Belongie, S.: Weakly supervised object localization with stable segmentations. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 193–207. Springer, Heidelberg (2008) 5. Zhou, Z., Zhang, M.: Multi-instance multi-label learning with application to scene classification. In: Proc. NIPS 2006, pp. 1609–1616 (2006) 6. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: Proc. ICCV 2007 (2007) 7. Li, J., Wang, J.Z.: Real-time computerized annotation of pictures. IEEE Trans. on PAMI 30(6), 985–1002 (2008) 8. Yang, L., Zheng, N., Yang, J., Chen, M., Cheng, H.: A biased sampling strategy for object categorization. In: Proc. ICCV 2009 (2009) 9. Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results (2007), http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html 10. Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. IJCV 43(1) (2001) 11. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features images classification. In: Proc. ECCV 2006, vol. 4, pp. 490–503 (2006) 12. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 13. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. IJCV 60(1), 63–86 (2004) 14. Zhang, M., Zhou, Z.: Multi-label learning by instance differentiation. In: Proc. AAAI 2007, pp. 669–674 (2007) 15. Mairal, J., Leordeanu, M., Bach, F., Ponce, J., Hebert, M.: Discriminative sparse image models for class-specific edge detection and image interpretation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 43–56. Springer, Heidelberg (2008) 16. Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image classification. IEEE Trans. on PAMI 30(9), 1632–1646 (2008) 17. Parikh, D., Zitnick, L., Chen, T.: Determining patch saliency using low-level context. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 446–459. Springer, Heidelberg (2008) 18. Kang, F., Jin, R., Sukthankar, R.: Correlated label propagation with application to multi-label learning. In: Proc. CVPR 2006, pp. 1719–1726 (2006) 19. Chen, Y., Bi, J., Wang, J.Z.: Miles: Multiple-instance learning via embedded instance selection. IEEE Trans. on PAMI 28(12), 1931–1947 (2006) 20. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. on PAMI 20, 1254–1259 (1998) 21. Marszalek, M., Schmid, C., Harzallah, H., van de Weijer, J.: Learning object representations for visual object class recognition. In: Visual Recognition Challenge workshop, in conjunction with ICCV 2007 (2007)
Distance-Based Multiple Paths Quantization of Vocabulary Tree for Object and Scene Retrieval Heng Yang1, Qing Wang1, and Ellen Yi-Luen Do2 1
School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, P.R. China 2 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA
[email protected]
Abstract. The state of the art in image retrieval on large scale databases is achieved by the work inspired by the text retrieval approaches. A key step of these methods is the quantization stage which maps the high-dimensional feature vectors to discriminatory visual words. This paper mainly proposes a distance-based multiple paths quantization (DMPQ) algorithm to reduce the quantization loss of the vocabulary tree based methods. In addition, a more efficient way to build a vocabulary tree is presented by using sub-vectors of features. The algorithm is evaluated on both the standard object recognition and the location recognition databases. The experimental results have demonstrated that the proposed algorithm can effectively improve image retrieval performance of the vocabulary tree based methods on both the databases.
1 Introduction We are interested in the issue of image retrieval in large-scale databases. Given a query image which contains either a particular object or a scene from an interest place, our motivation is to return from the large database a set of high content-related images in which that object or scene appears. However, this problem becomes difficult when one requires a search in a very large database in acceptable time. In general, the standard approach to solve this problem is firstly to represent the images by high-dimensional local features and then to match images by dealing with millions of feature vectors. Several successful image and video retrieval systems have been recently reported [1-8]. These methods mimicked the text-retrieval approaches using the analogy of “visual words” which are actually defined by specific feature vectors and can be considered as the division of the high-dimensional Euclidean space. Sivic and Zisserman [1] firstly employed a text retrieval approach for video object recognition. The feature vectors are quantized into bags of visual words, which are defined by k-means clustering method on feature vectors from the training frames. Then the standard TF-IDF (Term Frequency-Inverse Document Frequency) weighting scheme, which downweights the contribution of the commonly occurred words, is used for scoring the relevance of an image to the query one. Nistér and Stewenius [2] further developed H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 313–322, 2010. © Springer-Verlag Berlin Heidelberg 2010
314
H. Yang, Q. Wang, and E.Y.-L. Do
the method of [1]. They designed a hierarchical vector quantization method based on a vocabulary tree and thus a much larger and more discriminatory vocabulary can be used efficiently which can improve the image searching quality dramatically. Similarly, Moosmann et al. [4] employed a forest of random trees to rapidly and distinctively assign descriptors to clusters. Schindler et al. [3] used the same data structure as the vocabulary tree for large-scale location recognition application. In particular, they presented a Greedy N-Best Paths (GNP) algorithm to improve the retrieval performance of the traditional vocabulary tree algorithm by considering more candidates instead of one at each level of the tree. Philbin et al. [8] introduced the softassignment technology so that a high-dimensional descriptor can be mapped to a weighted combination of visual words, rather than hard-assigned to a single word. This method improved the performance of retrieval, since that it allowed the inclusion of features which were lost in quantization step based on the hard-assigned methods. Chum et al. [5] brought the query expansion technology, which is a standard method in text retrieval system, into the visual domain for improving the retrieval performance. They utilized the spatial constraint between query image and each returned image and used these verified images to learn a latent feature model which controlled the construction of expanded queries. In a word, there are mainly four key stages in current successful image retrieval systems based on the bag-of-visual-words model: 1) build a large and discriminatory visual vocabulary using the training data; 2) quantize the feature vectors of the database into their corresponding visual words respectively; 3) use the TF-IDF scheme to score the similarity between images; 4) employ well-known technologies to further refine the retrieval results, such as spatial verification and query expansion. In this paper, we only focus on the first two stages and we improve the image retrieval quality via two novel contributions in two fundamentally different ways – a more sophisticated quantization method for higher retrieval performance and a way to build vocabulary tree by sub-vectors of features for more efficiency. The remainder of this paper is organized as follows. Section 2 briefly reviews the traditional vocabulary tree algorithm and Section 3 presents our approach in details. The experimental results and related discussions are given in Section 4. Finally, the conclusion is summarized in Section 5.
2 Traditional Vocabulary Tree Algorithm The traditional vocabulary tree [2] is built by hierarchical k-means clustering algorithm on SIFT [9] feature vectors from the training data. A vocabulary tree is a kind of k-way tree of depth L. Therefore, there are kL leaf nodes (visual words) at the bottom of the tree. For a SIFT feature, it is down from the root node to a leaf node by comparing at each level the feature vector to the k candidate cluster centers and choosing the closest one. Then the path down the tree is encoded by an integer which represents a specific visual word. Finally, the relevance score between a query image and a database image is computed by using the standard TF-IDF scheme.
DMPQ of Vocabulary Tree for Object and Scene Retrieval
315
B
A
1
2 3
D
C
Fig. 1. Illustration of quantization loss in traditional vocabulary tree algorithm. Points A to D represent cluster centers and points 1, 2, 3 are feature vectors, respectively. The cross lines are the bounds defined by cluster centers A to D together. Points 1, 2, 3 will be assigned to different visual words despite them being close to each other.
However, there exist two drawbacks in the traditional vocabulary tree. The first and main drawback is the inaccurate quantization scheme which decreases the retrieval quality. At every level of the tree, the quantization loss will be inevitable when the feature vectors are located on the boundaries that are defined by the cluster centers (see Figure 1 for illustration). In addition, the memory cost for loading a large vocabulary tree is too high. In [2], loading a vocabulary tree with 10 branches and 6 levels occupies as much as 143MB of memory.
3 Our Approach On the analyses of the drawbacks of the traditional vocabulary tree, our approach is designed to solve two issues of quantization loss and low efficiency in this section. The proposed approach has three steps. Firstly, a large vocabulary tree is built based on sub-vectors. Then, feature vectors are assigned to visual words by the proposed quantization algorithm. Finally, the standard TF-IDF scheme is employed to give the relevance score between database image and the query one. 3.1 Building Vocabulary Tree Using Sub-Vectors of the Features There is an intuitional assumption in our approach, which is Assumption: If some of the corresponding sub-vectors of the features are close respectively, the whole feature vectors should be close to each other. This assumption has been employed in nearest neighbor searching in highdimensional space [10]. Intuitionally, the correct probability of this assumption will increase with the growing of the dimensionality of the sub-vector (denoted as Dsub). The experimental result (see Figure 2 in Section 4.1 for detail) has validated this assumption and the gain of retrieval performance could be negligible when Dsub≥60. That means that the vocabulary built by the sub-vectors is as discriminatory as the one built by the whole-vectors. The procedure building a vocabulary tree is described as Algorithm 1. It is worth mentioning that a bisecting k-means algorithm [11] is applied in our method instead of the regular k-means algorithm. Bisecting k-means has some advantages over k-means, such as more efficient and producing clusters with smaller
316
H. Yang, Q. Wang, and E.Y.-L. Do
entropy. Furthermore, the most important merit is that it tends to produce clusters of similar sizes, while k-means is known to produce clusters of widely different sizes. This merit is very important for training a large vocabulary tree. One purpose for building such a tree is to create leaf nodes (visual words) as more as possible, whereas k-means often produces null clusters when k is large and the number of features in current node is small. The total memory cost for a vocabulary tree is linear with the dimensionality of feature vector [2]. Therefore, building a vocabulary tree based on sub-vectors (Dsub= 60) can save more than half size of memory cost of the traditional method (D=128). Algorithm 1. Building Vocabulary Tree Based on Sub-Vectors Initialization: All the training features are loaded in the root node, set current level l=0, start position of sub-vector sp=0, Dsub=60 and SIFT vector length D=128. Use bisecting k-means approach [11] to partition the features in current node into k clusters according to the sub-vector at the position [sp, sp+Dsub-1] in each feature; Assign features of the k clusters to k children of current node respectively; l = l + 1; sp = (sp + Dsub) % (D - Dsub+1); The same process is applied to each child of current node recursively, and it is ended if l equals to maximum number of level L.
① ② ③ ④ ⑤
3.2 Distance-Based Multiple Paths Quantization The reason that results in the quantization loss in traditional vocabulary tree algorithm is that it only chooses the closest candidate at each level. An improved method is the Greedy N-Best Paths (GNP) algorithm [3] which chooses the closest N nodes at each level by comparing the k×N candidates. GNP algorithm can reduce the quantization loss. Because it considers more nodes in traversing a tree, which could decrease the risk that the feature vectors, near the bounds but close to each other, are quantized to different visual words. However, there still exist two problems in GNP algorithm. On one hand, the candidate paths number N is a constant, which is not flexible. For example, in the case that a feature vector is much nearer to its nearest cluster center than to its second nearest one, it is only need to consider the nearest center as the candidate instead of N candidates for the nearest one is discriminatory enough. On the other hand, there still exists risk for assigning those close feature vectors to different visual words at the last level of the tree, since GNP algorithm finally returns one closest candidate at the leaf level. To address the quantization issues mentioned above, we propose the distancebased multiple paths quantization (DMPQ) algorithm. Our approach is described in Algorithm 2, where d ql , nn _ m denotes the distance from the corresponding sub-vector of the feature q to its m-th nearest neighbor cluster center at the level l. DMPQ algorithm dynamically chooses the discriminatory candidates at each level according to the distance ratio, which will make DMPQ more efficient. Furthermore, DMPQ throws off the ambiguous features which are located around the bounds (measured by distance ratio between d qL, nn _1 and d qL, nn _ 2 ) at the leaf level of the tree. In this way,
DMPQ of Vocabulary Tree for Object and Scene Retrieval
317
DMPQ can even play a role of filter to remain the unambiguous feature vectors that have low risk of quantization loss. Note that traditional vocabulary tree is just the specific case of our approach in which tc=1 (m=1 in this case) and td=1; and so is GNP in which tc=0 and td=1. The parameters tc and td in DMPQ will be discussed in details in Section 4.1. Algorithm 2. Distance-based Multiple Paths Quantization (DMPQ) Initialization: A given feature q, level l =1, maximum paths number M, threshold tc for choosing candidates, and threshold td for discarding the ambiguous words. Compute distances from the corresponding sub-vector of q to all children of the root; While (l < L) { Find m (1 ≤ m ≤ M ) closest candidates at level l, and make them satisfy the following inequalities at the same time, d ql , nn _ 1 d ql , nn _ 1 ≥ tc > l ; l d q , nn _ m d q , nn _ m +1
① ② ③
④ l = l + 1; ⑤ Compute distances from the corresponding sub-vector of q to all k×m candidates at the level l; ⑥} ⑦ If ( d / d > t ) discard q; ⑧ Else q is quantized to the closest candidate at level L. L q , nn _1
L q , nn _ 2
d
3.3 Scoring Scheme The standard TF-IDF scheme has been successfully applied to infer the relevance score between images. We also follow the TF-IDF scheme, and the j-th element of the query visual word vector q and database visual word vector d are given as follows,
qj =
n jq
dj =
n jd
nq
nd
log
N Nj
(1)
log
N Nj
(2)
where njq and njd denote the number of the j-th visual word in query image and database image respectively; nq and nd are the number of features in query image and database image respectively; Nj is the number of images in database where the j-th visual word occurs, N is the number of images in database. According to Eq.(3), the relevance score that gives the similarity between query image Iq and database image Id can be calculated by the scalar product between the normalized q and d (terms the normalized vectors as q and d , respectively), Score(I q , I d ) = q i d =
∑
j | q j ≠ 0, d j ≠ 0
q j d j
(3)
318
H. Yang, Q. Wang, and E.Y.-L. Do
4 Experimental Results and Analysis To evaluate the performance of the proposed algorithm, we use two challenging image databases. One is of object recognition benchmark images provided by Nistér et al. [2]. It contains 10,200 images in groups of four that belong together. In each group, the same object is taken from different positions or under varying illumination conditions. The other database is of Ljubljana urban images for location recognition provided by Omercevic et al. [12]. It consists of 612 images of urban environment covering an area of 200×200 square meters. At each of the 34 standpoints, 18 images were captured at 6 orientations and 3 tilt angles. SIFT [9] algorithm is used for both local feature detection and description, which has been widely used nowadays. Our experiments are divided into three aspects. First, we test the parameters setting of our approach on the training data – a subset from the object recognition database. Second, we evaluate the retrieval quality of our method on the whole large database. Third, we run our algorithm on the Ljubljana urban database to further examine its performance in location recognition application. All the experiments are executed on a PC with Pentium IV dual-core 2.0 GHz processor and 2GB memory. 4.1 Parameters Setting on Training Data Set
In our approach, there are three important parameters, which are sub-vector dimensions Dsub in building vocabulary tree, and two thresholds tc and td in DMPQ algorithm for quantization. We choose the first 2000 images of the object recognition database [2] as both the testing benchmark for parameters setting and then the offline training data set for building a vocabulary tree. From the 2000 images, 1.5M SIFT feature vectors are generated. The retrieval quality can be measured by Average Retrieval Accuracy (ARA) which can be computed by (4), where cnti denotes how many of the first four most similar images in the same group as the i-th image (including the i-th image itself), n is the number of the database images. ARA =
1 n cnti ∑ n i =1 4
(4)
In all experiments in this section, the vocabulary trees using either whole-vectors or sub-vectors are both built with 10 branches and 6 levels, which will result in about 106 visual words. In addition, the paths number N in GNP and the maximum paths number M in DMPQ are both set to 10. Different vocabulary trees can be built with different Dsub using the Algorithm 1 described in Section 3.1. Figure 2 gives the ARA curves on the 2,000 training images with different settings of Dsub. The quantization method is the same as the traditional vocabulary tree. We can see that the average retrieval accuracy increases along with the growth of Dsub. However, the gain becomes negligible when Dsub≥60. That is to say, using Algorithm 1 with Dsub=60 can efficiently build an effective and discriminatory visual vocabulary as that using the whole-vectors of features. Apart from the setting of Dsub, it is also in need to discuss the other two thresholds of tc and td in DMPQ. Figure 3 gives the retrieval performance with different settings
DMPQ of Vocabulary Tree for Object and Scene Retrieval
319
of tc and td respectively. In both Figure 3(a) and (b), Dsub is set to 128. From Figure 3(a) (fixed td=1), we can find that when tc varies from 0 (in GNP case) to 1 (in traditional vocabulary tree case), there is a peak where tc is around 0.6. The reason is that tc gives our algorithm more chances to consider multiple candidates rather than one; and at the same time it stops some candidates to be examined in GNP algorithm so that it makes the searched candidates more distinctive and thus obtains more accurate results. From Figure 3(b) (fixed tc=0), we can find the best result is reached when td is set to around 0.9. Because td helps DMPQ to discard the ambiguous feature vectors that are located near the bounds at the leaf level, which could result in the accurate representation of the visual word vector of an image. The td should be set close to 1; otherwise, it makes the algorithm discard too many discriminatory feature vectors that are located far from the bounds. In summary of this section, the parameters setting of our method are: Dsub=60, tc=0.6 and td =0.9. 0.9 0.85 ARA
0.8 0.75 0.7 0.65 0.6 0
10
20
30
40
50
60
70
80
90
100 110 120 130
Dsub Fig. 2. Average Retrieval Accuracy (ARA) on the changing of Dsub 0.92 0.9 ARA
ARA
0.88 0.86 0.84 0.82 0.8 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
tc (a) ARA vs. tc
1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.5
0.6
0.7
0.8
0.9
1
td (b) ARA vs. td
Fig. 3. ARA versus the changing of tc and td in DMPQ algorithm
4.2 Performance Comparison on the Whole Large-Scale Database
We have already built vocabulary tree using the subset (first 2000 images) of the object recognition database. Now we compare the performance of our algorithm to traditional vocabulary tree [2] and traditional vocabulary tree with GNP algorithms
320
H. Yang, Q. Wang, and E.Y.-L. Do
[3] on the whole database which contains up to 10,200 images. The vocabulary trees of the three algorithms have the same shape with 10 branches and 6 levels. Parameter N in GNP and M in DMPQ are set equally to 10. The three key parameters of our algorithm are set: Dsub=60, tc=0.6 and td =0.9. The performance of the three algorithms along with the changing number of images is shown in Figure 4. We can see that the GNP algorithm indeed improves the performance of the traditional vocabulary tree algorithm. In particular, our algorithm gives the best results. Even when the database size is up to 10K, our algorithm can obtain about 80% average retrieval accuracy. 1
Our algorithm Traditional vocabulary tree with GNP Traditional vocabulary tree
ARA
0.95 0.9 0.85 0.8 0.75 0.7 0
2000
4000 6000 Number of images
8000
10000
Fig. 4. Performance of our algorithm compared to traditional vocabulary tree and traditional vocabulary tree with GNP on the whole object recognition database
4.3 Performance Comparison on Ljubljana Urban Database
One may doubt the proposed algorithm whether a trained vocabulary tree from one database could be applicable to another database. To validate this, we apply our vocabulary tree trained from the object recognition database to Ljubljana urban database [12]. The urban database is for the purpose of location recognition, consisting of 612 images of urban environment covering an area of 200*200 square meters. At each of the 34 standpoints, 18 images were captured at 6 orientations and 3 tilt angles. Figure 5 gives example images taken at the first standpoint. For every standpoint, we choose the middle two as query images (such as Q1 and Q2 in Figure 5) and define their neighbors as the retrieval ground truth. Therefore, there are totally 68 query images. Figure 6 shows the performance of the proposed algorithm, traditional vocabulary tree and traditional vocabulary tree with GNP. The parameters setting for the three algorithms are the same as in Section 4.2. The curves in Figure 6 show the average number of ground truth images in n top-ranked retrieved images and n is changing from 5 to 30. It can be seen that the average scores of all the three algorithms don’t exceed 4.5 due to the weak bounds between the database images. Although the query images and their neighbors are taken in the same standpoint, the content between them are sometimes low related. However, all of the three algorithms can get the satisfactory performance in this database – averagely more than 3 correct location images could be returned. Furthermore, our algorithm shows the best results compared to the other two methods. As retrieval examples, Figure 7 lists the most four similar retrieved images for three query samples using the proposed algorithm.
DMPQ of Vocabulary Tree for Object and Scene Retrieval
321
Fig. 5. Example images taken in one standpoint. The middle two are taken as query images. For Q1, the 9 nearest images (include itself) within the big blue rectangle are defined as its neighbors. Similarly, the images within the big red dashed rectangle are Q2’s neighbors. The neighbors are considered as the ground truth images of the query one. 4.5
Score
4 3.5 Our algorithm Traditional vocabulary tree with GNP Traditional vocabulary tree
3 2.5 5
10
15
20
25
30
Number of retrieved images
Fig. 6. Retrieval performance of the three algorithms on Ljubljana urban database
Fig. 7. Three retrieval examples on the Ljubljana urban database using the proposed algorithm. The query samples are listed in the first column and the top four similar images retrieved by our algorithm are listed from second to fifth columns respectively. The first example shows the perfect performance of our algorithm for all the retrieved images are the neighbors of the query one and they are indeed high content-related. The second example actually also shows the perfect retrieval quality as the first example. But the image within the blue dashed rectangle (content-related to the query in fact) is considered as the wrong result for it is taken in different standpoint to the query one. The third example shows the worst performance of the three examples. The last two retrieved images (within the blue dashed rectangles) are wrong results which are neither content-related nor taken in the same standpoint to the query image.
322
H. Yang, Q. Wang, and E.Y.-L. Do
5 Conclusion The main contribution of this paper is to propose a distance-based multiple paths quantization algorithm to solve the issue of quantization loss in traditional vocabulary tree method. DMPQ algorithm dynamically chooses the discriminatory candidates at each level according to the distance ratio and throws off the ambiguous features which are located around the bounds at the leaf level of the tree. In addition, a more efficient way to build vocabulary tree based on sub-vectors is introduced, which not only creates large and discriminatory visual vocabulary but saves much memory cost. Experimental results have proved the efficiency and effectiveness of our algorithm in image retrieval based applications, such as object recognition and scene recognition. In future work, we plan to apply our method on the internet-scale image databases, such as on the photo-sharing website Flickr [13].
Acknowledgments This work is supported by National Natural Science Fund (60873085) and National Hi-Tech Development Programs under grant No. 2007AA01Z314, P. R. China. The first and second authors are also supported by China State Scholarship fund.
References 1. Sivic, J., Zisserman, A.: A Text Retrieval Approach to Object Matching in Videos. In: ICCV, vol. 2, pp. 1470–1477 (2003) 2. Nistér, D., Stewénius, H.: Scalable Recognition with a Vocabulary Tree. In: CVPR, vol. 2, pp. 2161–2168 (2006) 3. Schindler, G., Brown, M., Szeliski, R.: City-Scale Location Recognition. In: CVPR (2007) 4. Moosmann, F., Triggs, B., Jurie, F.: Randomized Clustering Forests for Building Fast and Discriminative Visual Vocabularies. In: NIPS (2006) 5. Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Total recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. In: ICCV (2007) 6. Jegou, H., Harzallah, H., Schmid, C.: A Contextual Dissimilarity Measure for Accurate and Efficient Image Search. In: CVPR (2007) 7. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object Retrieval with Large Vocabularies and Fast Spatial Matching. In: CVPR (2007) 8. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. In: CVPR (2008) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: IJCV, vol. 60 (2004) 10. Yang, H., Wang, Q., He, Z.: Randomized Sub-Vectors Hashing for High-Dimensional Image Feature Matching. In: ACM MM, pp. 705–708 (2008) 11. Li, Y., Chung, S.M.: Parallel Bisecting K-means with Prediction Clustering Algorithm. The Journal of Supercomputing 39, 19–37 (2007) 12. Omercevic, D., Drbohlav, O., Leonardis, A.: High-Dimensional Feature Matching: Employing the Concept of Meaningful Nearest Neighbors. In: ICCV (2007) 13. http://www.flickr.com
Image-Set Based Face Recognition Using Boosted Global and Local Principal Angles Xi Li1 , Kazuhiro Fukui2 , and Nanning Zheng1 1
Xi’an Jiaotong University, China
[email protected],
[email protected] 2 University of Tsukuba, Japan
[email protected]
Abstract. Face recognition using image-set or video sequence as input tends to be more robust since image-set or video sequence provides much more information than single snapshot about the variation in the appearance of the target subject. Usually the distribution of such image-set approximately resides in a low dimensional linear subspace and the distance between image-set pairs can be defined based on the concept of principal angles between the corresponding subspace bases. Inspired by the work of[4,14], this paper presents a robust framework for image-set based face recognition using boosted global and local principal angles. The original multi-class classification problem is firstly transformed into a binary classification task where the positive class is the principal angle based intra-class subspace “difference” and the negative one is the principal angle based inter-class subspace “difference”. The principal angles are computed not only globally for the whole pattern space but also locally for a set of partitioned sub-patterns. The discriminative power of each principal angle for the global and each local sub-pattern is explicitly exploited by learning a strong classifier in a boosting manner. Extensive experiments on real life data sets show that the proposed method outperforms previous state-of-the-art algorithms in terms of classification accuracy.
1 Introduction This paper presents a robust framework for image-set based face recognition using boosted global and local principal angles. Face recognition has been an active research field due to its various real life applications such as human-computer interface, nonintrusive public security et. al. Some traditional methods for face recognition include Principal Component Analysis(PCA), Independent Component Analysis(ICA), Linear Discriminative Analysis(LDA) and Bayesian method[1,2,3,4]. Recently face recognition using image-set or video sequence has attracted more and more attention within computer vision and pattern recognition community[5,6,7,8,9,10,11,12]. One reason is that the availability of modern cheap capture devices makes video sequence a more natural choice of input for face recognition tasks, for example in the surveillance scenario. More importantly, compared with single snapshot, a set or a sequence of images provides much more information about the variation in the appearance of the target subject. H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 323–332, 2010. c Springer-Verlag Berlin Heidelberg 2010
324
X. Li, K. Fukui, and N. Zheng
The variation always exists in the context of subject recognition or visual surveillance applications, where multiple shots of target subject under varying illumination and facial expressions or surveillance system output over long periods of time are available. Previous studies show that more robust recognition performance can be achieved by fully exploiting these kind of information[6]. It is well known that the appearance distribution of image-set or sequence for a target subject captured under changing facial expressions and varying illumination conditions can be approximately represented by a low dimensional linear subspace. The principal angles between subspace pairs can be used as a distance measure of the corresponding image-set pairs. Many algorithms have been proposed for robust recognition using the concept of principal angle based subspace distances. A noteworthy work is the Mutual Subspace Method(MSM) presented by Yamaguchi et al [6]. In MSM, each image set is represented by the linear subspace spanned by the principal components of the data and the smallest principal angle between subspaces is exploited as distance measure. The original MSM can be further improved such as in the Constrained Mutual Subspace Method(CMSM)[8]. Instead of directly applying the classification using the principal angle based subspace distance, the underlying idea of CMSM is to learn a linear transformation while in the transformed space the corresponding inter-class subspace distances are larger than that in the original feature space. That is to say, the subspace bases for different classes in the transformed space are more orthogonal to each other. The above methods were further extended to their non-linear counterparts by using the kernel trick such as in [7,9]. Kim et. al.[10] borrowed the idea of Linear Discriminant Analysis and presented an alternate method which tries to iteratively minimize the principal angles of within-class sets and maximize the principal angles of between-class sets. Inspired by the work of[4,14], this paper presents a robust framework for image-set based face recognition using boosted global and local principal angles. The original multi-class classification problem is firstly transformed into a binary classification task where the positive class is the principal angle based intra-class subspace “difference” and the negative one is the principal angle based inter-class subspace “difference”. The principal angles are computed not only globally for the whole pattern space but also locally for a set of partitioned sub-patterns. This scheme is robust to local variants and can make the inherent linear principal angle based methods faithfully describe non-linear patterns to some extent. Furthermore, the discriminative power of each principal angle for the global pattern and each local sub-pattern is explicitly exploited and appropriately aggregated by learning a strong classifier in a boosting manner. Extensive experiments on real life data sets show that the proposed method outperforms the previous state-ofthe-art algorithms in terms of classification accuracy. The rest of this paper is organized as follows: we first overview the concept of the principal angles between subspace bases for the corresponding image-sets. We then discuss some previous principal angle based image-set classification methods and their drawbacks in section 2. Section 3 describes the proposed method in detail. Section 4 is the experimental results and section 5 draws the conclusion.
Image-Set Based Face Recognition
325
2 Image-Set Based Face Recognition Using Principal Angles We first define the image-set based face recognition problem as follows: Specifically, given C subjects n input image-sets Xi ∈ Rrc×hi , i = 1, ..., n and corresponding subject identities yi ∈ {1, 2, ..., C},i = 1, ..., n, where rc is the image vector dimension and hi is the number of images in the i-th image set or length of the i-th sequence. A face image pattern obtained from one view can be represented as a point in a highdimensional feature space where an r × c pixel pattern is treated as a vector x in rcdimensional space. Each input image set has the underlying k-dimensional subspace structure denoted as Ui ∈ Rrc×k , i = 1, ..., n, which can approximately describe the variations in the appearance caused by different illuminations, varying poses and facial expression, et. al. For a test image-set Xtest , the task is to predict its corresponding identity ytest . 2.1 Principal Angles between Subspace Pairs Recently in computer vision community, the concept of principal angles[13] is used as a distance measure for matching two image sets or sequences, where each of them can be approximated by a linear subspace[6]. If the principal angles between two subspaces derived from two image-sets are smaller enough, then the two sets are considered similar. Generally, let UA ,UB represent two k-dimensional linear subspaces. The principal angles 0 ≤ θ1 ≤ θ2 ≤ ... ≤ θk ≤ π/2 between the two subspaces can be uniquely defined as[6,13]: cos2 (θi ) =
max
A uA i ⊥uj (j=1,2,...,i−1) B uB i ⊥uj (j=1,2,...,i−1)
B 2 (uA i · ui ) A 2 2 ui uB i
(1)
B where uA i ∈ UA , ui ∈ UB . Denote Dis(XA , XB ) as the principal angle based distance between image-sets XA , XB with corresponding subspace bases UA and UB . It is a function of principal angles in the following form: Dis(XA , XB ) = f (θ1 , θ2 , ..., θk ). Different methods for image-set based recognition has different empirical form of the function f , which will be discussed in detail in the next subsection.
2.2 Image-Set Based Recognition Methods Using Principal Angles The original baseline Mutual Subspace Method(MSM)[6] performs the nearest neighbor search in the original feature space without feature extraction procedure. The distance between the subspace bases of the corresponding image-set pairs is defined as the minimum principal angle in the form of Dis(XA , XB ) = θ1 . The Constrained Mutual Subspace Method(CMSM)[8] project C C the Toriginal feature onto the first d few eigenvectors of the G = P = i i=1 i=1 Ui Ui , which is the sum of the projection matrix of all classes. The distance between the subspace bases of the corresponding transformed image-set pairs tis defined as the mean of the first t smallest principal angles as Dis(XA , UB ) = 1t i=1 θt . Not only the dimension of the
326
X. Li, K. Fukui, and N. Zheng
generalized difference subspace but also the number of t has great effect on the recognition rate and the selecting of appropriate choice is quite empirical and case-dependent. The Discriminative Analysis of Canonical Correlations(DCC)[10] is an alternate method which tries to iteratively minimize the principal angles of within-class sets and maximize the principal angles of between-class sets. Generally, for a subspace base pairs UA , UB which have rank of k, there exist k principal angles. All above previous methods does not consider the different discriminative power of each principal angles. They either use the minimum principal angle or use the mean or weighted mean of a truncated form of all principal angles as the similarity measure. The main drawback of the these empirical schemes is that each principal angle based features has its own discriminative power to some extent, we should not only exploit as much as discriminative information of each of them but also the aggregation should be implemented in a more reasonable way, not just the mean or weighted mean of each principal angle based features. For example, Figure 1 is an example curve of the effect of each principal angle on the recognition rate.
0.9 0.85
Recognition Rate
0.8 0.75 0.7 0.65 0.6 0.55 0.5
1
2
3
4 5 6 7 Index of Principal Angles
8
9
10
Fig. 1. The recognition performance of using each individual principal angle as the distance measure for a typical image-set based face recognition task
Since the original principal angle is computed directly on the global pattern and its linear property in nature, it has the following inherent shortcomings: 1) it can not describe the nonlinear real life data faithfully. The real life high dimensional data, such as the vectorized image data, is often inherently nonlinear rather than simple normally distribution[18]. Wolf. L. et al [7] extended the MSM to its non-linear counterpart by using the kernel method. In their method, an input pattern is first mapped onto a higher dimensional feature space via a nonlinear map, then the MSM is applied to the linear subspaces generated from the mapped patterns. The performance is dependent on the form of the kernel function and the corresponding parameters to be tuned. K. Fukui et. al. further extended the CMSM to the nonlinear case[9]. The Kernel-CMSM outperforms the CMSM greatly but at the expense of prohibitively computation load which make the real life applications of Kernel-CMSM quite difficult. 2) the original principal angle based methods is performed on the whole global patterns and is less robust to the
Image-Set Based Face Recognition
327
local variants. Each local region also contains different principal angle based discriminative power and should be aggregated in a reasonable way.
3 Image-Set Based Face Recognition Using Boosted Global and Local Principal Angles Based on the above analysis and inspired by the work of[4,14], this paper presents a robust framework for image-set based face recognition using boosted global and local principal angles. The original multi-class classification problem is firstly transformed into a binary classification task where the positive class is the principal angle based intra-class subspace “difference” and the negative one is the principal angle based interclass subspace “difference”. Here we use the principal angles between subspace base pairs to resemble the facial image differences as in[4]. The principal angles are computed not only globally for the whole pattern space but also locally for a set of partitioned sub-patterns. The discriminative power of each principal angle for the global and each local subpattern is explicitly exploited and appropriately aggregated by learning a strong classifier in a boosting manner. Different contributions for the principal angle base classification made by different local sub-patterns of the whole global face pattern are emphasized also in the boosting manner. 3.1 The Principal Angle Based Intra-class and Inter-class Subspace Difference Besides the traditional PCA, ICA, LDA methods[1,2,3], the bayesian method proposed by Moghaddam et. al.[4] is one of the most successful algorithms for face recognition. In their method, the original multi-class face recognition problem is first converted into a two-class problem. Based on the Gaussian distribution assumption, the intra- and inter- personal subspace are learned to describe the variation in difference images of the same individual and different individuals respectively. The similarity score can be evaluated in terms of a posteriori probability using Bayesian rule. For more detail refer to literature[4]. Similar to the above approach, we denote ΘI the principal angle based intra-class subspace ”difference” , which describe the principal angle between image-set pairs corresponding to the same subject. The inter-class subspace ”difference” ΘE can be defined in a similar way which describe the principal angle between image-set pairs corresponding to the different subject. More specific, given a subspace pairs Ui , Uj with principal angles P A(Ui , Uj ) = [θ1 , θ2 , ..., θk ] P A(Ui , Uj ) ∈ ΘI , if Ci = Cj P A(Ui , Uj ) ∈ ΘE , if Ci = Cj Due to its inherently linear property, the principal angles computed on the global image patterns can not describe the nonlinear real life face patterns faithfully and is not robust to local variants. We partition the original face image to a set of equally sized subimages as in[14]. Then all those sub-pattern sharing the same original components are respectively collected from the training set to compose a corresponding sub-pattern’s
328
X. Li, K. Fukui, and N. Zheng
Fig. 2. The flowchart of the proposed method for image-set base face recognition using boosted global and local principal angles. the up-left part illustrates how to extract local sub-patterns using a sliding rectangle window as in[14].
training set. For each sub-pattern’s training set, the intra-class and inter-class subspace “difference” are computed in a similar way to that of global counterpart. The global and local principal angles are concatenated to form the principal angle features for intra-class and inter-class respectively, which are provided as inputs to the boosting procedure described in the next subsection. For extracting each sub-pattern, we slide a rectangle window on the whole image pattern in left-to-right and up-to-down order as in[14]. Figure 2 illustrate the flowchart of the proposed methods.
Image-Set Based Face Recognition
329
3.2 Principal Angle Boosting Boosting is a classifier ensembling method and has success applications such as in the face/object detection. In this paper, we treat each principal angle as the input features for the inter-class(ΘE ) and intra-class(ΘI ) classification problem and learn the strong classifier using boosting to combine the discriminative power of each principal angle. We use the Realboost algorithm proposed in [15], which is an extension of the discrete adaboost algorithm. Realboost deals with confidence-rated weak classifier which has a real-valued output and the confidence of the strong classifier can be easily evaluated. Here the confidence resembles the intra-personal and inter-personal likelihood as in[4]. The procedure of the boost learning algorithm can be described in detail as follows: Training: Input: Given the training data {Ωi ∈ Γ }, i = 1, ..., n and the corresponding labels yi = {−1, +1}, i = 1, ..., n. Ω = {θg , θl1 , θl2 , ..., θlm } where θg is the principal angle features for global patterns and θlj , j = 1, ..., m is the principal angle features for the j-th local sub-patterns. The number of the maximum boosting step T . Procedures: 1) Initialize the sample weights W1 (i) = 1/n, i = 1, ..., n, 2) Repeat for t = 1, ..., T , 2.1) For each principal angle feature based weak classifier: a) Partition the space Γ into several disjoint blocks as Γ1 , Γ2 , ..., Γz b) Under the current weighting Wt , calculate Wlj = P r(Ωi ∈ Γj , yi = l) = i:Ωi ∈Γj ,yi =l Wt (i) where l = {−1, +1} W j +
c) ∀Ω ∈ Γj , h(Ω) = 12 ln( W+1 ) j + −1
where is a small positive constant. d) calculate the normalization factor: j j Z = 2 j (W+1 W−1 ) 2.2) Select the weak learner for the minimum Z and set the corresponding function h as ht 2.3) Update the sample weight as Wt+1 (i) = Wt (i)exp(−yi ht (xi )) and re-normalize so that the sum of elements of W equals to 1. 3) The final strong classifier isT H(Ω) = sign[ t=1 ht (Ω)] And the confidence of the output of H can be defined as T Conf idence(Ω) = | t=1 ht (Ω)| Testing: For a given test image-set, its principal angle based “difference” with each training image-set is computed first and classified using the learned strong classifier. The label of the training image-set with the highest intra-class confidence score is selected as the output label.
330
X. Li, K. Fukui, and N. Zheng
Fig. 3. Classification rate comparison between MSM,KMSM,CMSM,DCC,the proposed method with global principal angles boosting only(PABM-G) and the proposed method with both global and local principal angles boosting(PABM-GL) for a)CMU-PIE dataset 2) YaleB dataset and c) a self-collected face video dataset.
4 Experimental Result In this section we test the proposed method using real life facial image databases, which include the CMU-PIE database[16], the YaleB face database[17] and a self-collected face video database. We compare the performance of the following algorithms: 1) The baseline Mutual Subspace Method(MSM)[6]; 2) The Kernel Mutual Subspace Method(KMSM)[7]; 3) The Constrained Mutual Subspace Method(CMSM)[8]; 4) The
Image-Set Based Face Recognition
331
Discriminative Analysis of Canonical Correlations(DCC)[10]; 5) The proposed Principal Angle Boosting Method using only Global patterns(PABM-G); 6) The proposed Principal Angle Boosting Method using both Global and Local patterns(PABM-GL); The linear subspace of each image-set was learned using principal component analysis and the corresponding dimension was chosen to represent 98% data energy. For CMSM, the dimension of the generalization difference subspace was empirically set to be the 95% of the full image dimension. For KMSM, a six-degree polynomial kernel was used as in[7]. For PABM-GL, we extract the local sub-patterns using a sliding rectangle window with size 8 × 8 in pixels and steps in vertical and horizontal direction are set to be 4 pixels. • For the YaleB face database we used images of 38 subjects. Firstly 80 near-frontal images under different illuminations per subject were selected. Then the face regions are cropped and resize to 24×21 in pixels. The 80 images are divided into 10 image-sets with each image set has 8 images. • For the CMU-PIE face database we used images of 45 subjects and for each subject 160 near-frontal images were selected which cover variations in facial expression and illumination. Then the face regions are cropped and resize to 24 × 24 in pixels. The 160 images are divided into 16 image-sets with each image set has 10 images with different illumination conditions. • To further illustrate the performance of the proposed method, we self-collected a face video database which has 20 subjects and for each subject 10 separate sequences were recorded with large variation in illumination and facial express. Each specific sequence contains about 180 frames and the face region was extracted automatically using a face detector. The face regions were histogram equalized and resized to 25 × 25 in pixels. For all the above three datasets each face vector was normalized to the unit. Then we randomly selected 3,4,5,6 images-sets from each class into training sets and the remaining into testing set, respectively. For each number of training sets, the random partition procedure repeated for 10 times and the average classification results are computed. The recognition rate comparison for the above experiments is demonstrated in Figure 3. It can be seen that the proposed method of recognition using boosted global principal angles outperforms the MSM,KMSM,CMSM,and DCC consistently. The performance can be further improved by integrating local principal angle boosting.
5 Conclusion This paper presents a robust framework for image-set based face recognition using boosted global and local principal angles. The original multi-class classification problem is firstly transformed into a binary classification task where the positive class is the principal angle based intra-class subspace “difference” and the negative one is the principal angle based inter-class subspace “difference”. The principal angles are computed not only globally for the whole pattern space but also locally for a set of partitioned subpatterns. The discriminative power of each principal angle for the global and each local subpattern is explicitly exploited by learning a strong classifier in a boosting manner. Experiments on real life data sets demonstrate the superior performance of the proposed method to previous state-of-the-art algorithms in terms of classification accuracy..
332
X. Li, K. Fukui, and N. Zheng
References 1. Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neurosicence 3(1), 71–86 (1991) 2. Etemad, K., Chellappa, R.: Discriminant Analysis for Recognition of Human Face Images. Journal of the Optical Society of America A 14(8), 1724–1733 (1997) 3. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face Recognition by Independent Component Analysis. IEEE Trans. on Neural Networks 13(6), 1450–1464 (2002) 4. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian Face Recognition. Pattern Recognition 33(11), 1771–1782 (2000) 5. Shakhnarovich, G., Fisher, J.W., Darrell, T.: Face recognition from long-term observations. In: Proceedings of European Conference on Computer Vision, pp. 851–865 (2002) 6. Yamaguchi, O., Fukui, K., Maeda, K.: Face recognition using temporal image sequence. In: International Conference on Automatic Face and Gesture Recognition, pp. 318–323 (1998) 7. Wolf, L., Shashua, A.: Learning over Sets using Kernel Principal Angles. Journal of Machine Learning Research 4(10), 913–931 (2003) 8. Fukui, K., Yamaguchi, O.: Face recognition using multi-viewpoint patterns for robot vision. In: 11th International Symposium of Robotics Research, pp. 192–201 (2003) 9. Fukui, K., Stenger, B., Yamaguchi, O.: A framework for 3D object recognition using the kernel constrained mutual subspace method. In: Asian Conference on Computer Vision, partI, pp. 315–324 (2006) 10. Kim, T.K., Kittler, J., Cipolla, R.: Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(6), 1005–1018 (2007) 11. Fan, W., Yeung, D.Y.: Locally linear models on face appearance manifolds with application to dual-subspace based classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2006) 12. Wang, R.P., Shan, S.G., Chen, X.L., Gao, W.: Manifold-Manifold Distance with Application to Face Recognition based on Image Set. In: IEEE Conference on Computer Vision and Pattern Recognition (2008) 13. Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–372 (1936) 14. Chen, S.C., Zhu, Y.L.: Subpattern-based principal component analysis. Pattern Recognition 37(1), 1081–1083 (2004) 15. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37, 297–336 (1999) 16. Sim, T., Baker, S., Bsat, M.: The CMU Pose, Illumination, and Expression Database. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(12), 1615–1618 (2003) 17. Georghiades, A.S., Belhumeur, P.N., Kriegman, D.J.: From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 643–660 (2001) 18. Schlkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge (2002)
Incorporating Spatial Correlogram into Bag-of-Features Model for Scene Categorization Yingbin Zheng, Hong Lu, Cheng Jin, and Xiangyang Xue Shanghai Key Laboratory of Intelligent Information Processing School of Computer Science, Fudan University, China {ybzh,honglu,jc,xyxue}@fudan.edu.cn
Abstract. This paper presents a novel approach to represent the codebook vocabulary in bag-of-features model for scene categorization. Traditional bag-of-features model describes an image as a histogram of the occurrence rate of codebook vocabulary. In our approach, spatial correlogram between codewords is incorporated to approximate the local geometric information. This works by augmenting the traditional vocabulary histogram with the distance distribution of pairwise interest regions. We also combine this correlogram representation with spatial pyramid matching to describe both local and global geometric correspondences. Experimental results show that correlogram representation can outperform the histogram scheme for bag-of-features model, and the combination with spatial pyramid matching improves effectiveness for categorization.
1
Introduction
In this paper, we consider the representation of visual information, a fundamental problem of automatic scene categorization. A good representation should be discriminative to inter-category variations and robust to intra-category variations. Given a collection of images, a common way is to represent images by color, edge, texture or shape, and then classify them to predefined categories, e.g., street, suburb, bedroom, etc. These representations are also useful in content-based multimedia analysis and retrieval applications [1][2][3]. A popular recognition scheme is to represent and match scenes on the bag-offeatures model [4][5]. An image is represented as an orderless collection of local features. SIFT feature [6] is widely used for its good performance in the classification/categorization task. Once local features are obtained, an unsupervised clustering algorithm, typically k-means [7], is applied to quantize the continuous high-dimensional local feature space into a discrete vocabulary of codebook. The bag-of-features model corresponds to a histogram of the occurrence rate of particular vocabulary in the given image. One drawback of traditional bag-of-features model is the limited descriptive ability of spatial information because orderless feature collection does not contain spatial layout or geometric correspondence of the features. Recent works in bag-of-features model with spatial layout show the use of spatial information H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 333–342, 2010. c Springer-Verlag Berlin Heidelberg 2010
334
Y. Zheng et al.
can improve the classification/categorization quality [8][9][10][11][12][13]. Specifically, a sequence of grids at different resolutions can be constructed for a given image with each grid being represented as a histogram of vocabulary. The histogram with pyramid match kernel [14] is used as features in spatial matching scheme. Approaches based on the spatial pyramid matching achieve significantly improved performance on scene categorization tasks [8]. Another strategy to overcome this drawback is to augment the traditional bag-of-features models with geometric information such as the pairwise relationship between neighboring local features [15][16][17]. However, the computational cost of these methods is high. We propose a new spatial correlogram approach to represent geometric information and improve bag-of-features model for visual categorization. Informally, the spatial correlogram of two codewords vi and vj in vocabulary indicates the distance distribution of the positions of interest regions, which are encoded to vi and vj , respectively. The highlights of this representation are: (i) it contains the local geometric correspondence; (ii) with similar storage and computational cost, the categorization accuracy of correlogram representation outperforms the histogram in bag-of-feature model. The remainder of this paper is organized as follows. Section 2 reviews the related literature on bag-of-features model. Section 3 presents our proposed spatial correlogram approach for scene categorization. Section 4 discusses the experimental setup. In Sect. 5 we present our experimental results. And finally, we conclude our work in Sect. 6.
2
Related Work
Bag-of-features model has been studied for video retrieval [4] and image categorization [5]. Local features of an image are extracted and represented as a codeword according to vocabulary of codebook. The histogram of vocabulary is generated to describe the image. There are many attempts to incorporate the spatial layout or geometric correspondence with the basic bag-of-features models [13][8]. Specifically, Fei-Fei and Perona [13] used a bayesian hierarchical model to learn natural scene categories. The algorithm [13] was modified based on Latent Dirichlet Allocation model(LDA) [18] and proposed to represent codebook distribution of each category. Another improvement on bag-of-features model focuses on the spatial layout information. Lazebnik et al. [8] introduced spatial pyramid matching for scene categorization based on global geometric correspondence. The image is partitioned into subregions and a histogram of local features is calculated inside each subregion. The experiments showed excellent performance on three diverse datasets. Besides the global geometric correspondence, a more intuitive and efficient approach is to add pairwise relations of neighboring local features [4][16]. Sivic and Zisserman [4] considered spatial consistency, which is measured by requiring that neighboring matches in the query region lie in a surrounding area in the retrieved frame. In [16], “doublets”, a second vocabulary built on spatial co-occurrences
Incorporating Spatial Correlogram into Bag-of-Features Model
335
of word pairs, is included to encode spatially local co-occurring regions for image segmentation. However, these methods lead to large computational costs. The focus of this paper is to propose a new representation which contains sufficient spatial co-occurrence information of codebook vocabulary, and computes with efficiency.
3
Spatial Correlogram Approach
Histogram is a simple and powerful representation as a method for image description in computer vision. In bag-of-features model, an image is firstly represented as an orderless collection of local features. Given a set of training images, an unsupervised clustering algorithm, typically k-means, is applied to generate the codebook vocabulary V = {v1 , v2 , · · · , vn }, where n is the size of codebook. A codebook histogram is obtained for each image with each bin representing a codeword vi in V : N 1, if vi = arg minv∈V (Dist(v, rk )) Hist(vi ) = (1) 0, otherwise k=1
where rk denotes the kth interest region, N is the total number of interest regions in the image, and Dist(v, rk ) is the distance between vocabulary v and interest region rk in local feature space. The histogram approximates to the global distribution of vocabulary frequency. One drawback of histogram representation is its limited descriptive ability of spatial information. In image indexing/retrieval area, Huang et al. [19] proposed the color correlogram feature to distil the spatial correlation of colors. This feature achieves better performance than the traditional color histogram scheme. Our strategy is inspired based on the color correlogram approach to improve the bag-of-features model. We differ from their representation [19] by incorporating “neighboring” codewords in vocabulary to reduce the dimension of representing features. In a given image I, suppose pk = (xk , yk ) is the position of interset region rk , and the corresponding codeword of rk in the codebook is v(rk ). A table indexed by codebook vocabulary and position distance in image is used to describe the distance distribution between interest region rk and its neighbors: Tk (vj , d) =
|{ri ∈ I|v(ri ) = vj , pk − pi = d}| |{ri ∈ I|v(ri ) = vj }|
(2)
where ri ∈ I means interest region ri in image I, and pk − pj indicates the position distance between interest region rk and rj . For convenience, L∞ norm is used to measure this distance, i.e., pk − pj = max{|xk − xj |, |yk − yj |}. To describe the general distribution from codeword vi to vj over the whole image, we use the following equation: T (vi , vj , d) = Tk (vj , d) (3) v(rk )=vi
336
Y. Zheng et al.
Fig. 1. Toy example of Corr scheme in an image with codebook vocabulary V = {, } and step s = 2. At the top, the Corr table indexed by vocabulary and step is calculated for each interest region. Next, accumulate the table whose interest regions map the same codeword and generate the representation.
Comparing with histogram representation(1), T (vi , vj , d) includes information about distance distribution of the position of codewords in codebook vocabulary. However, the storage and computational requirement is larger than histogram. Assume the image size is w × h, the potential maximum position distance is D = max(w, h), and the vocabulary size is n. O(n2 D) space is required while O(n) is required for histogram representation. Two phases are considered in order to reduce the storage and computational cost. Instead of using position distance directly to describe the pairwise relations, we divide the range from 0 to D into s “steps”; each step is represented by a distance interval I = [a, b). For all examples in this paper, we segment the range [0, D) as I1 = [0, L), I2 = [L, 2L), . . . , Is = [(s − 1)L, D) where L = min(w,h) . s Equation (3) can be rewritten as: Corr(vi , vj , I) = T (vi , vj , d) = Tk (vj , d) (4) d∈I
d∈I v(rk )=vi
In order to distinguish this correlogram scheme and its variants in the following part, we call Equ. 4 as “Corr” scheme, while correlogram approach represent both Corr and its variants. The storage space of Corr(vi , vj , I) is reduced to O(n2 s). Moreover, Equ. 4 describes the spatial correlogram of a flexible range and avoids the overfitting problem. An illustration of Corr scheme is shown in Fig. 1. Furthermore, we exploit the discriminative ability of single pairwise spatial correlogram. A simple and intuitive way is to capture spatial correlation between
Incorporating Spatial Correlogram into Bag-of-Features Model
337
identical codeword only. This self-correlogram approach(Self-Corr) is defined according to the following formula: Self Corr(vi , I) = Corr(vi , vi , I) = Tk (vi , d) (5) d∈I v(rk )=vi
The corresponding space of Self-Corr is O(ns), depending on the number of steps s. In fact, Self-Corr is degenerated to the histogram representation when s = 1. Besides the Self-Corr approach, we augment the correlation by examining the vocabulary and its neighborhood in the original local feature space; i.e., for each vocabulary vi , a subset of neighboring codewords Vi in codebook is also found using k-nearest neighbor(kNN) algorithm, and only the codeword vj ∈ Vi ∩ {vi } is used to compute the spatial correlogram Corr(vi , vj , I) in Equ. 4 which is called as kNN-Corr. In our experiments, we choose k=1 or 3. This strategy is a trade-off between Corr approach(4) and Self-Corr approach(5): the storage and computational cost is lower than Corr approach if the value k is not large, and it is able to represent more spatial information of distance distribution than the Self-Corr approach. Comparison of these approaches will be discussed in Sect. 5.
4
Experimental Setup
We evaluate the correlogram approach on a scene dataset from Lazebnik et al. [8]. There are 15 natural scene categories and 200 to 400 images per category. For each category, 100 images are randomly selected for training and the rest for testing. We repeat this random operation 10 times and generate 10 pairs of the training/testing subset. Sample images of the dataset are shown in Fig. 2. The implementation of codebook follows the general scheme for scene categorization based on the bag-of-features model. For each image in the dataset, a set of local features are firstly extracted. We follow the image feature setting from [8][10]. Dense sampling over image grids is then applied because of the advantage for learning scene categories reported in [13]. In our experiment, SIFT descriptor [6] is used to describe the interest regions for its good performance for object and scene recognition [21][22]. To generate the codebook vocabulary, k-means clustering is selected since it is simple and widely used in previous works [4][5][8][13][22]. The performance difference between k-means and other clustering algorithm are not the focus of this paper. The vocabulary sizes we use are {16, 25, 50, 100, 200, 400, 800, 1600}. We define the representations with less than or equal to 100 codewords as “weak feature” and otherwise as “strong feature” for the different discriminative power. To get the categorization accuracy, support vector machine(SVM) with histogram intersection kernel is used in our experiment. Specifically, libSVM [23] is used to train using the one-versus-one rule for multi-class classification. The categorization rate in the following section is the average of categorization accuracy over the ten random selected testing subsets.
338
Y. Zheng et al.
Fig. 2. Sample images of the 15 natural scene categories. Eight of these (a-h) were originally collected by Oliva and Torralba [20], five (i-m) by Fei-Fei and Perona [13], and two (n-o) by Lazebnik et al. [8].
5
Experiments and Results
In this section, we compare correlogram approach with the traditional histogram representation and show the results based on the 15 natural scene categories. 5.1
Experiment 1: Comparison of Correlogram and Histogram
We start the experiment with the comparison of correlogram and histogram representation. As introduced in Sect. 3, we employ five kinds of representation strategies for weak features: histogram(1), Self-Corr(5), 1NN-Corr, 3NN-Corr, and Corr(4). For the four kinds of correlogram representations, we set the position distance separated into 2 steps, i.e., two distance intervals for each image. The storage and computational cost increase from histogram to Corr scheme. The experimental results are shown in Fig. 3. For weak feature, we can see that Corr scheme performs the best among all the representations. From the baseline histogram scheme, the average categorization accuracy improves at the rate of 5.5%-7.4% to Self-Corr approach, and 9.8%-15.3% to Corr approach scheme. Specifically, the Corr approach with 100 vocabulary achieves the accuracy 71.42%, which is comparable to the histogram scheme with 1600 vocabulary. However, the storage and computational cost of the Corr representation are extremely expensive. For strong features, we only investigate the Self-Corr, 1NN-Corr, and 3NNCorr schemes for efficiency. Though the difference between results of histogram and Self-Corr scheme decreases while the codebook size increase, the Self-Corr
Incorporating Spatial Correlogram into Bag-of-Features Model
339
&DWHJRUL]DWLRQ5DWH
KLVWRJUDP VHOIFRUU QQFRUU QQFRUU FRUU
&RGHERRN6L]H
Fig. 3. Evaluation of the bag-of-features model with histogram and correlogram representation strategies for both weak features and strong features. The codebook size |V | ∈ {16, 25, 50, 100, 200, 400, 800, 1600}. Table 1. Histogram and Self-Corr scheme with same representation dimension Repr. dim. 50 100 200 400 800 1600 Histogram 57.4 61.6 65.4 68.7 70.6 71.3 Self-Corr 59.7 63.9 67.1 69.0 70.6 71.4
can still outperform the histogram method with the same codebook size. For 1NN-Corr and 3NN-Corr schemes, the categorization accuracy performs better than that of Self-Corr before the codebook size increases to 800; when the codebook size is larger than 800, the performance is not as good as Self-Corr. It is also interesting to compare performances of different representations with same storage space. Using a codebook with vocabulary size n, histogram and SelfCorr need O(n) and O(2n) space in this experiment, respectively. We compare histogram and Self-Corr scheme with the same representation dimension, e.g., histogram with n = 50 versus Self-Corr with n = 25, histogram with n = 100 versus Self-Corr with n = 50, etc. Generally, codebook with larger size has higher classification accuracy. However, Table 1 lists the comparison results: with same representation dimension, Self-Corr representations perform as well as or better than histogram; that means, incorporating spatial correlogram information to bag-of-features model is comparable with the histogram scheme even though the vocabulary size for correlogram method is half of which for histogram. 5.2
Experiment 2: Step in Correlogram Representation
In this subsection we focus on the impact of different step values, which is an important factor in our correlogram representation. To explore the role of step value in correlogram representation, we experiment different step s ∈ {1, 2, 4, 8, 12, 16} with Self-Corr and kNN-Corr (k = 3) schemes based on weak feature (|V | = 16) and strong feature(|V | = 200). Note that step s = 1 indicates the baseline histogram representation. The results from Fig. 4 show that increasing the step
Y. Zheng et al.
&DWHJRUL]DWLRQ5DWH
&DWHJRUL]DWLRQ5DWH
340
_9_
_9_
_9_
_9_
6WHS
(a) Self-Corr
6WHS
(b) 3NN-Corr
Fig. 4. Correlogram schemes with different steps Table 2. Categorization results of the bag-of-features model and the spatial pyramid with histogram and correlogram representations. For completeness, both singlelevel(the “single” columns) and multi-level(the “pyramid” columns) pyramids are listed in the table. Repr. Level 0 1 2
Weak feature(|V | = 16) Histogram Correlogram Single Pyramid Single Pyramid 47.4 54.8 56.7 58.5 61.8 66.0 60.2 62.5 62.6 67.4
Strong feature(|V | = 200) Histogram Correlogram Single Pyramid Single Pyramid 65.4 69.0 71.5 71.9 71.7 73.3 72.5 72.8 71.3 74.0
clearly improves over the correlogram representation for both weak features and strong features. Specifically, the categorization accuracy increases fast when step value grows from 2 to 8. Note that increasing step value from s = 8 to s = 16 results in a small performance increase for all the curves in Fig. 4, probably due to the limitation when the distance intervals are too small and unable to capture the general distance distribution of neighborhood. 5.3
Experiment 3: Incorporating with Spatial Pyramid
In particular, we consider incorporating correlogram approach with the spatial pyramid matching scheme [8]. Images are partitioned into subregions. The histogram and Self-Corr(with step s = 2) of codebook are found inside each subregion. The maximum level of our experiment is 2, i.e., images on 1 × 1, 2 × 2 and 4 × 4 grids. From the results shown in Table 2, it is observed that the multi-level spatial pyramid matching performs better than the single-level bag-of-features model for both correlogram and histogram representation. For weak feature, the histogram representation improves from single-level to multi-level pyramid at the rate of 1.8% for level 1 and 2.3% for level 2, while in Self-Corr representation the improvement rates are 4.2% and 4.8%, respectively,
Incorporating Spatial Correlogram into Bag-of-Features Model
341
which is better than histogram representation. The result shows the effectiveness of combining Self-Corr and spatial pyramid matching for weak feature. For strong feature, we find that the method fails to improve single-level categorization rate when the level grows from 1 to 2 for strong feature’s Self-Corr scheme, probably because images are over-subdivided under 4 × 4 grids for correlogram. However, the multi-level accuracy still increases from 73.3% to 74.0%. And the same multi-level pattern as for weak feature is observed: correlogram performs better than histogram at level 1 and 2. Another observation is that using the spatial pyramid, Self-Corr at level 1 achieves better performance than histogram at level 2, for weak and strong features. We can see both of spatial pyramid matching and spatial correlogram approximate the geometric correspondence of image. Spatial pyramid matching works by computing at global or subregion scale to represent the spatial layout, and our spatial correlogram represents with pairwise relations between codewords, or alternately the neighboring local features in the original local feature space. Framework combined with these two methods can include global and local geometric correspondence and is able to show powerful descriptive ability of spatial information.
6
Conclusion
In this paper, we propose the spatial correlogram representation to incorporate geometric information and improve the bag-of-features model for scene categorization. The histogram representation for bag-of-features model is replaced by the pairwise relationship of codebook vocabulary. This relation is explicitly extracted from the distance distribution of pairwise interest regions. Two kinds of variants are considered in order to reduce the storage and computational cost: the self-correlogram scheme captures spatial correlation between identical codeword only, and the kNN-correlogram scheme exploits the discriminative ability of the correlation between codeword and its neighborhood in the original local feature space. Experiments on 15 natural scene categories show that the correlogram representation can outperform the traditional histogram scheme with the same codebook size or the same representation dimension in the bag-of-features model. In addition, we examine the combining of correlogram representation and spatial pyramid matching. These methods have different but complementary abilities to express spatial information of an image. The experiments show that the combination achieves powerful descriptive ability of spatial information for scene categorization.
Acknowledgements This paper is supported by Natural Science Foundation of China (No. 60873178 and 60875003), National Science and Technology Pillar Program of China (No. 2007BAH09B03) and Shanghai Municipal R&D Foundation (No. 08dz1500109).
342
Y. Zheng et al.
References 1. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2), 1–60 (2008) 2. Vogel, J., Schiele, B.: Semantic modeling of natural scenes for content-based image retrieval. IJCV 72(2), 133–157 (2007) 3. Li, J., Wang, J.Z.: Real-time computerized annotation of pictures. IEEE Trans. PAMI 30(6), 985–1002 (2008) 4. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: ICCV, pp. 1470–1477 (2003) 5. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22 (2004) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 7. Hartigan, J.A., Wong, M.A.: A K-means clustering algorithm. Applied Statistics 28, 100–108 (1979) 8. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR, pp. 2169–2178 (2006) 9. van de Sande, K.E., Gevers, T., Snoek, C.G.: A comparison of color features for visual concept classification. In: CIVR, pp. 141–150 (2008) 10. van Gemert, J., Geusebroek, J.M., Veenman, C.J., Smeulders, A.W.M.: Kernel codebooks for scene categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 696–709. Springer, Heidelberg (2008) 11. Liu, X., Wang, D., Li, J., Zhang, B.: The feature and spatial covariant kernel: adding implicit spatial constraints to histogram. In: CIVR, pp. 565–572 (2007) 12. Battiato, S., Farinella, G., Gallo, G., Ravi, D.: Scene categorization using bag of textons on spatial hierarchy. In: ICIP, pp. 2536–2539 (2008) 13. Li, F.F., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR, pp. 524–531 (2005) 14. Grauman, K., Darrell, T.: The pyramid match kernel: Efficient learning with sets of features. JMLR 8, 725–760 (2007) 15. Berg, A.C., Berg, T.L., Malik, J.: Shape matching and object recognition using low distortion correspondences. In: CVPR, pp. 26–33 (2005) 16. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their location in images. In: ICCV, pp. 370–377 (2005) 17. Lazebnik, S., Schmid, C., Ponce, J.: A maximum entropy framework for part-based texture and object recognition. In: ICCV, pp. 832–838 (2005) 18. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003) 19. Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: CVPR, pp. 762–768 (1997) 20. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV 42(3), 145–175 (2001) 21. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. PAMI 27(10), 1615–1630 (2005) 22. van de Sande, K., Gevers, T., Snoek, C.: Evaluation of color descriptors for object and scene recognition. In: CVPR, pp. 1–8 (2008) 23. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm
Human Action Recognition under Log-Euclidean Riemannian Metric Chunfeng Yuan1, Weiming Hu1, Xi Li1, Stephen Maybank2, Guan Luo1 1
National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing, China {cfyuan,wmhu,lixi,gluo}@nlpr.ia.ac.cn 2 School of Computer Science and Information Systems, Birkbeck College, London, UK
[email protected]
Abstract. This paper presents a new action recognition approach based on local spatio-temporal features. The main contributions of our approach are twofold. First, a new local spatio-temporal feature is proposed to represent the cuboids detected in video sequences. Specifically, the descriptor utilizes the covariance matrix to capture the self-correlation information of the low-level features within each cuboid. Since covariance matrices do not lie on Euclidean space, the Log-Euclidean Riemannian metric is used for distance measure between covariance matrices. Second, the Earth Mover’s Distance (EMD) is used for matching any pair of video sequences. In contrast to the widely used Euclidean distance, EMD achieves more robust performances in matching histograms/distributions with different sizes. Experimental results on two datasets demonstrate the effectiveness of the proposed approach. Keywords: Action recognition, Spatio-temporal descriptor, Log-Euclidean Riemannian metric, EMD
1 Introduction Human action recognition is a paramount but challenging task in computer vision. It has many potential applications, such as intelligent surveillance, video indexing and browsing, human-computer interface etc. However, there exist many difficulties with human action recognition, including geometric variations between intra-class objects or actions, as well as changes in scale, rotation, viewpoint, illumination and occlusion. In recent years, a number of approaches have been proposed to fulfill action recognition task. Among them, bag of visual words (BOVW) approaches are greatly popular, due to their simple implementation, low cost, and good reliability. By fully exploiting local spatio-temporal features, the BOVW approaches are more robust to noise, occlusion, and geometric variation than other ones. In this paper, we propose a BOVW based framework for human action recognition. The framework has the following two contributions. Firstly, a novel local spatio-temporal descriptor under the Log-Euclidean Riemannian metric [1, 10] is proposed for human action recognition. To the best of our knowledge, it is applied to human action recognition for the first time. Compared with several popular descriptors in action recognition, our descriptor has the advantages of high discrimination and low H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 343–353, 2010. © Springer-Verlag Berlin Heidelberg 2010
344
C. Yuan et al.
computational cost. Second, we employ the EMD [13] to match pairs of video sequences. Several desirable properties of the EMD ensure that it is more suitable for action recognition than many of the other histogram matching measures. The remainder of the paper is organized as follows. Section 2 gives a review of BOVW approaches. Section 3 discusses the proposed framework for human action recognition, including the new descriptor based on the Log-Euclidean Riemannian metric and classification based on Earth Mover’s Distance. Section 4 reports experimental results on two human action datasets. Section 5 concludes the paper.
2 Related Work Inspired by the bag of words (BOW) approaches used in text retrieval, some state-of-the-art approaches [2, 3, 4, 5, 17, 19] take the BOVW strategy for action recognition. Typically the BOVW based approaches proceed with the following steps: patch extraction, patch description, histogram computation, and histogram classification. For each step, many corresponding algorithms are proposed. The following is a brief introduction to the aforementioned four steps needed by the BOVW based approaches. In the patch extraction step, Laptev [2] first extends the notion of the Harris spatial interest points into the spatio-temporal domain. In [3], Laptev detects more interest points at multiple spatio-temporal scales. Niebles et al. [4] use separable linear filters to extract features. Dollár et al. [7] improve the 3D Harris detector and apply Gabor filtering to the temporal domain. Wong et al. [8] propose a global information based detector and run experiments with four different detectors on the KTH dataset. The experimental results indicate that Dollár et al.’s detector achieves a better recognition accuracy, Laptev’s detector gives insufficient interest points, while the saliency detector [11] reports many of points without discriminative enough. Consequently, our framework employs Dollár et al.’s detector. Patch description plays a fundamental role in that it directly determines the recognition performance. Usually, the patch is represented as a cuboid which includes spatio-temporal information against the 2-D block. In fact, several local spatial descriptors used in image [9] are extended into spatio-temporal domain to form the cuboid descriptors, which extract image feature vectors from the given cuboid. Typically, there are three kinds of feature extraction methods for the cuboid. (1) The simplest method is to concatenate all points in the cuboid in turn [4, 8]. However, it is sensitive to small perturbations inside the cuboid and usually too high in dimension to be used directly. (2) Compute the histogram of features in cuboid (e.g. the histogram of gradient values (HOG) [3]). Such methods are robust to perturbations but ignore all positional information. (3) The cuboid is divided into several sub-cuboids, and then histogram is computed for each sub-cuboid separately [7, 14]. This local histogram makes a tradeoff between the former two kinds of methods, such as SIFT in [14]. In comparison, our descriptor is significantly different from the above three kinds of methods, for it utilizes the statistical property of the cuboids under the Log-Euclidean Riemannian metric. Several classifiers are used in the last step - histogram matching and classification. Niebles et al. [4] use latent topic models such as the Probabilistic Latent Semantic Analysis (PLSA) model and Latent Dirichlet Allocation (LDA) model. Histogram
Human Action Recognition under Log-Euclidean Riemannian Metric
345
features of training or testing samples are concatenated to form a co-occurrence matrix as the input of the PLSA and LDA. Schuldt et al. [2] use the Support Vector Machines (SVM) for classification. Lucena et al. [6] and Dollár et al. [7] use Nearest Neighbor Classifier (NNC) to classify videos.
Fig. 1. Flowchart of the proposed framework
3 Our Action Recognition Framework 3.1 Overview The proposed action recognition framework directly handles the input unsegmented image sequences to recognize low-level actions such as walking, running, or hand clapping. Notice that there is no need for any preprocessing in our recognition system. But in [12, 18, 20], there is a common limitation that a figure centric spatio-temporal volume or silhouette for each person must be obtained and adjusted with a fixed size beforehand. As we know, object segmentation and tracking is a hard task in computer vision. Fig. 1 shows the flowchart of the framework. First of all, we employ the Dollár et al.’s detector [7] to detect cuboids at each frame. Subsequently, a new descriptor is proposed to extract effective feature from the cuboids. Further, the features under the Log-Euclidean Riemannian metric from training videos are quantized to form an appearance codebook (i.e. BOVW) by using the k-mean clustering method. In this case, each video sample is eventually represented as a histogram of BOVW. In the testing phase, the test video is also represented as the histogram of BOVW and then classified according to histogram matching between the test video and training videos. Specifically, the EMD is employed for matching each video pair instead of the Euclidean distance. Finally, the test video is classified according to the nearest classification criterion. In the sections 3.2 and 3.3, our descriptor and the EMD based histogram matching are described in detail. 3.2 The Descriptor A good descriptor for action recognition should satisfy many qualifications: (a) scale invariance; (b) camera viewpoint invariance; (c) rotation invariance; (d) robustness to
346
C. Yuan et al.
partial occlusion; (e) insensitivity to illumination change; (f) tolerance to large geometric variations between intra-class samples. Motivated by this, our novel descriptor is based on the Log-Euclidean Riemannian metric and greatly different from the previous methods. It provides a new fusion mechanism of low-level features in the cuboid. The construction process of this descriptor includes two steps: computing the covariance matrix of low-level image features and Log-Euclidean mapping for covariance matrix. Based on the covariance matrix, our descriptor has the properties of (c) and (e). In addition, our descriptor is also scale invariant to a certain extent, for the cuboid is obtained from the video according to its scale. A discussion on the qualification (f) is given in section 4. 3.2.1 Covariance Matrix Computation The low-level features are extracted from the cuboids at first. Let s be a pixel in a cuboid, then all the points in the cuboid form a points set S = {s1, s2, …, sN }, where N is the number of points. Three sorts of low-level information at each point si in the cuboid are extracted and thus the vector of pixel si is represented as an 8-D feature vector li = (x, y, t, fx, fy, ft, vx, vy), where (x, y, t) is the positional vector, (fx, fy, ft) is the gradient vector, and (vx, vy) is the optical flow vector. As a result, the cuboid is represented by a 8-D feature vectors set L= {l1 , l2 , …, lN}, with the total dimensions being 8×N (Usually N is several thousands). Because of high dimension of cuboid feature L, it is necessary to transform L into a more compact form. We utilize the covariance matrix to characterize the cuboid feature L. Therefore, the cuboid is represented as an 8×8 covariance matrix:
C=
1 N (l − u)(li − u)T ∑ i =1 i N −1
(1)
where u is the mean of vectors in L. Therefore, the dimension of the cuboid feature is reduced from 8×N to 8×8. Besides, the covariance matrix can be computed easily and quickly. The covariance matrix (1) reflects the second-order statistical properties of elements in the vector li. Moreover, covariance matrixdoes not have any information regarding the ordering and the number of its vectors. This leads to a certain scale and rotation invariance. Further, it is proved in [16] that large rotations and illumination changes are also absorbed by the covariance matrix. 3.2.2 Riemannian Geometry for Covariance Matrices The covariance matrix is a symmetric nonnegative definite matrix. In our case, it is usually a symmetric positive definite (SPD) matrix. However, it does not lie on a Euclidean space. Thus, it is very necessary to find a proper distance metric for measuring two covariance matrices. Recently, a novel Log-Euclidean Riemannian metric [1] is proposed on the SPD matrices. Under this metric, the distance measures between SPD matrices take a very simple form. Therefore, we employ the Log-Euclidean Riemannian metric to measure the distance between two covariance matrices. The following is a brief introduction to the Log-Euclidean Riemannian metric. Given an n×n covariance matrix C, the singular value decomposition (SVD) of C is denoted as U∑U T, where ∑= diag (λ1, λ2, …, λn) is the diagonal matrix of the
Human Action Recognition under Log-Euclidean Riemannian Metric
347
eigenvalues, and U is an orthonormal matrix. By derivation, the matrix logarithm log(C) is defined as:
∞ ( − 1 )k + 1 log (C) = ∑ (C − I n )k = U ⋅ diag( log (λ1 ),log (λ2 ),L ,log (λn )) ⋅ U T (2) k k =1 where In is an n×n identity matrix. Under the Log-Euclidean Riemannian metric, the distance between two covariance matrices A and B can be easily calculated by log (A) − log (B) . Compared with the widely used affine-invariant Riemannian metric, the Log-Euclidean Riemannian metric has a much simpler form of distance measure. Moreover, the Log-Euclidean mean can be computed approximately 20 times faster than affine-invariant Riemannian. Please see more details of these two metrics in the literature [1, 10]. Therefore, according to the above two steps, each cuboid is represented as a low-level feature covariance matrix under the Log-Euclidean Riemannian metric. It has the advantages of low computational complexity and high discrimination. 3.3 Video Classification Based on EMD
It is reported that the Earth Mover’s Distance (EMD) can achieve better performances for image retrieval than some of the common histogram dissimilarity measures [15]. Following the observations [15], we employ the EMD to match pairs of video sequences in our action recognition framework. The EMD, proposed by Yossi Rubner et al. [13], is the minimal amount of work that must be performed to transform one distribution into the other by moving “distribution mass” around. Here the distribution is
{(
) (
)}
called signature. Let P = p1,ω p1 ,L , pm ,ω pm be the first signature with m clusters, where pi is the cluster prototype and ω pi is the weight of the cluster; and let
{(
) (
)}
Q = q1,ωq1 ,L , q n ,ωqn be the second signature with n clusters; and D = ( dij )m×n denotes the ground distance matrix where dij is the ground distance between clusters pi and qj. We want to find a flow F = ( fij )m×n, that minimizes the overall cost:
WORK(P, Q ,F) =
∑ ∑ m
i =1
n
f d j =1 ij ij
(3)
where fij (1≤ i ≤ m, 1≤ j ≤ n) is the flow between pi and qj. Once the transportation problem is solved, and we have found the optimal flow F, the earth mover's distance is defined as the work normalized by the total flow:
∑ ∑ EMD(P,Q) = ∑ ∑ m
i =1 m
i =1
n
f d j =1 ij ij n f j =1 ij
(4)
In our recognition framework, the specific procedure of the EMD based histogram matching is listed in Table 1. We first calculate a ground distance matrix Dall of all visual words, in order to avoid computing the ground distance between visual words repeatedly. The L2-norm distance is used as the ground distance, and therefore the ground distance matrix is the covariance matrix of all the visual words.
348
C. Yuan et al. Table 1. The specific procedure of the EMD based histogram matching Input: the visual words histograms of testing and training videos Ħtest, Ħtrain, the covariance matrix of all visual words Dall. Output: the action classes of testing videos. Algorithm: 1. Look up Dall to form the ground distance matrix D of testing and training videos. 2. Obtain the weight vector of each video by computing the percentage of its each word relative to its total words. 3. Solve the program (3) to obtain the optimal flow F. 4. Compute the EMD between testing and training videos by (4). 5. The testing video is classified by the nearest neighboring criterion.
Theoretically, the EMD has many good properties, making it more robust for action recognition in contrast to other histogram matching techniques. Firstly, it well tolerates some amount of features deformations in the feature space, and there is no quantization problems caused by rigid binning. Small changes in the total number of clustered visual words do not affect the result drastically. Thus, it’s unnecessary to cluster the visual words accurately. Secondly, it allows for partial matching. Since the detector directly manipulates the input videos in our framework, it is inevitable that some of the cuboids will come from background. Partial matching is able to handle this disturbance. Thirdly, it can be applied to distributions/signatures with different sizes, leading to better storage utilization. The numbers of visual words occurring in different video samples vary widely. For example, the number of visual words occurring in a ‘skip’ action video is 50, but in a ‘bend’ action video it is 10. In summary, the EMD can improve the classification performances for action recognition due to its robustness. In comparison, the bin-to-bin histogram dissimilarity measures, like the Euclidean distance, are very sensitive to the position of the bin size and the bin boundaries.
4 Experiments As illustrated in Fig. 2, we test our approach on two multi-pose and multi-scale human action datasets: the Weizmann dataset and the KTH dataset, which have been used by many authors [3, 4, 12] recently. We perform leave-one-out cross-validation to make performance evaluations. Moreoever, there is no overlap between the training set and testing set. The Weizmann human action dataset contains 10 different actions. There are totally 93 samples in the dataset performed by 9 subjects. All the experiments on this dataset use the videos of the first five persons to produce the bag of visual words. In each run, 8 actors’ videos are used as the training set and the remaining one person’s videos as the testing set. So the results are the average of 9 times runs. The KTH video database containing six types of human actions performed by 25 subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. There are totally 599 sequences in the dataset. The videos of the first two persons are used to produce the bag of visual words. In each run,
Human Action Recognition under Log-Euclidean Riemannian Metric
349
Fig. 2. Representative frames from videos in two datasets: Row 1 are sampled from Weizmann dataset, Row 2 are from KTH dataset
24 actors’ videos are used as the training set and the remaining one person’s videos as the testing set. The results are the average of 25 times runs. We divide the experiments into two groups. One group aims at comparing our descriptors with three other typical descriptors: Laptev’s spatio-temporal jets [2], PCA-SIFT [7, 14], and histogram of oriented gradient (HoG) [3]. The second group of experiments aims at validating that the EMD is more robust than other histogram dissimilarity measures. 4.1 Descriptor Comparison
We compare our descriptors with three other typical descriptors. Specifically, Laptev’s spatio-temporal jet is 34-dimensional gradient vector l=(Lx,Ly,Lt,Lxx,…,Ltttt), where L is the convolution of original frame and an anisotropic Gaussian kernel with independent spatial variance and temporal variance. PCA-SIFT descriptor applies Principal Components Analysis (PCA) to the normalized gradient vector formed by flattening the horizontal and vertical gradients of all the points in the cuboid. HoG is obtained by computing the normalized gradient histogram of all the points in the cuboid. The same experimental configurations are adopted for these descriptors, and the EMD based classifier is employed. ST jets
PCA-SIFT
HoG
ST jets[2] PCA-SIFT[7,14] HoG [3] Our’s
Our’s 1
1.2
0.9 0.8
1
0.7
0.8
0.6
0.6
0.5 0.4
0.4
0.3
0.2
0.2 0.1
0
0 Jog
Run
Walk
Average
ge
e2
Handclap Handwave
e ra
e1
av W
ip
a lk
av
W
W
Av
(a)
Sk
n
de Si
p
Ru
mp
ck
um Pj
Ju
Ja
Be
nd
Box
(b)
Fig. 3. Comparisons between four descriptors for each action on the two datasets. The left is on the Weizmann dataset, and the right is on the KTH dataset.
350
C. Yuan et al.
Fig. 3 (a) and (b) respectively show the classification results of the four descriptors on the two datasets. It’s clear that the average recognition accuracy of our descriptor is above 10% higher than others on both datasets. More specifically, it achieves the best recognition performances for nine actions on the Weizmann dataset and four actions on the KTH dataset. 4.2 EMD Based Classification vs. the Euclidean Distance Based Classification
The second group of experiments aims to demonstrate the robustness of the EMD versus other histogram dissimilarity measures. Lucena et al. [6] and Dollár et al. [7] Table 2. Comparisons between the Earth Mover’s Distance based classification and the Euclidean distance based classification on Weizmann and KTH database. The average recognition accuracies are shown. ST jets 0.5000 0.7111 0.6424 0.6753
ED EMD ED EMD
Weizmann KTH
PCA-SIFT 0.6000 0.7667 0.6094 0.6424
HoG 0.6000 0.7444 0.6354 0.6076
Our 0.6778 0.9000 0.6840 0.7969
Table 3. The confusion matrices of our approach. The top one is on the KTH action dataset. The bottom is on the Weizmann database.
box
.76
.06
.13
.00
.05
.00
handclap
.05
.75
.19
.00
.01
.00
handwave
.14
.21
.63
.01
.01
.00
jog
.00
.00
.02
.90
.05
.03
run
.02
.00
.00
.05
.83
.10
Walk
.00
.00
.01
.05
.92
.02
box
handclap
handwave
jog
run
Walk
bend
1.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
Jack
.00
1.00
.00
.00
.00
.00
.00
.00
.00
.00
Jump
.00
.00
.67
.00
.08
.00
.25
.00
.00
.00
Pjump
.00
.00
.00
1.00
.00
.00
.00
.00
.00
.00
run
.00
.00
.00
.00
.80
.00
.20
.00
.00
.00
side
.00
.00
.00
.00
.00
1.00
.00
.00
.00
.00
Skip
.00
.00
.17
.00
.00
.00
.67
.17
.00
.00
Walk
.00
.00
.00
.00
.00
.11
.00
.89
.00
.00
Wave1
.00
.00
.00
.00
.00
.00
.00
.00
1.00
.00
.00
.00
.00
.00
.00
.00
.00
1.00
run
side
Skip
Wave2
.00
.00
bend
Jack
Jump Pjump
Walk Wave1 Wave2
Human Action Recognition under Log-Euclidean Riemannian Metric
351
0.92 0.9
Rec ogni t i on Rat e
0.88 0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.72 100
200
300
400 500 600 700 The number of visual words
800
900
1000
Fig. 4. Recognition accuracy obtained by the proposed framework vs. vocabulary size
measure the dissimilarity between videos by using the Euclidean distances between histograms, and then assign the test sequence to the class label identical to the nearest training video. We compare the EMD based approach with the Euclidean distance based approach [6, 7]. In this group of experiments, the experimental configurations for the Euclidean distance based approach are the same as ours. Table 2 reports the experimental results of the two classification approaches on the two datasets. For the three descriptors, the recognition accuracies of the EMD based approach all exceed greatly those of the Euclidean distance based ones. For the KTH dataset, we can see that the EMD based approach is on average 3.8% higher the Euclidean distance based approach. For the Weizmann dataset, the average recognition accuracy of our descriptor is about 10% higher than other ones. More importantly, our EMD-based descriptor achieves the best recognition accuracies on the two datasets. Table 3 shows the confusion matrices of our approach on the Weizmann and KTH dataset. From the confusion matrix of the Weizmann dataset, it is seen that our approach works much better on the actions with large movements, but somewhat gets confused with the actions of small difference. The recognition accuracies for the actions with large movements are maximally high up to 100%, such as “bend”, “Jack”, “Pjump”, “side”, “wave1”, and “wave2”. From the confusion matrix of the KTH dataset, we can see that the “hand” related actions (“boxing”, “handclapping”, and “handwaving”) are a little confused with each other. One possible reason for this is that our cuboids are insufficient for representing the details of the action. In our approach, about 40 cuboids are extracted in each video, and the BOVW consists of 300 visual words. But in [21], 200 cuboids are extracted in each video and the BOVW contains 1000 visual words. Finally, we evaluate the influence of the number of visual words in the BOVW on recognition accuracy using the Weizmann dataset, as illustrated in Fig.4. When the number of visual words is more than 300, the recognition accuracy fluctuates from 80% to 90%. The dependency of the recognition accuracy on the size of vocabulary is not very serious.
352
C. Yuan et al.
5 Conclusion In this paper, we have developed a framework for recognizing low-level actions from input video sequences. In our recognition framework, the covariance matrix of the low-level features in cuboids has been used to represent the video sequence under the Log-Euclidean Riemannian metric. The descriptor is compact, distinctive, and has low computational complexity. Moreover, we have employed EMD to measure the dissimilarity of videos instead of traditionally Euclidean distances of histograms. Experiments on two datasets have proved the effectiveness and robustness of the proposed framework.
Acknowledgment This work is partly supported by NSFC (Grant No. 60825204, 60672040, 60705003) and the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453, 2009AA01- Z318).
References 1. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric Means in a Novel Vector Space Structure on Symmetric Positive-Definite Matrices. SIAM J. Matrix Anal. Appl., 328–347 (2007) 2. Schuldt, C., Laptev, I., Caputo, B.: Recognizing Human Actions: A Local SVM Approach. In: ICPR, pp. 32–36 (2004) 3. Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008) 4. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised Learning of Human Action Categories Using Spatial Temporal Words. In: IJCV, pp. 299–318 (2008) 5. Yan, K., Sukthankar, R., Hebert, M.: Efficient Visual Event Detection using Volumetric Features. In: ICCV, pp. 166–173 (2005) 6. Lucena, M.J., Fuertes, J.M., Blanca, N.P.: Human Motion Characterization Using Spatio-temporal Features. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds.) IbPRIA 2007. LNCS, vol. 4477, pp. 72–79. Springer, Heidelberg (2007) 7. Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition Via Sparse spatiotemporal Features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005) 8. Wong, S., Cipolla, R.: Extracting Spatiotemporal Interest Points using Global Information. In: ICCV, pp. 1–8 (2007) 9. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. PAMI 27(10), 615–1630 (2005) 10. Li, X., Hu, W., Zhang, Z., Zhang, X., Zhu, M., Cheng, J.: Visual Tracking Via Incremental Log-Euclidean Riemannian Subspace Learning. In: CVPR (2008) 11. Kadir, T., Zisserman, A., Brady, M.: An Affine Invariant Salient Region Detector. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 228–241. Springer, Heidelberg (2004) 12. Fathi, A., Mori, G.: Action Recognition by Learning Mid-level Motion Features. In: CVPR (2008)
Human Action Recognition under Log-Euclidean Riemannian Metric
353
13. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV, pp. 59–66 (1998) 14. Yan, K., Sukthankar, R.: PCA-SIFT: A More Distinctive Representation for Local Image Descriptors. In: CVPR, pp. 506–513 (2004) 15. Rubner, Y., Tomasi, C., Guibas, L.J.: The Earth Mover’s Distance as a Metric for Image Retrieval. IJCV 40(2), 99–121 (2000) 16. Tuzel, O., Porikli, F., Meer, P.: Region Covariance: A Fast Descriptor for Detection and Classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 589–600. Springer, Heidelberg (2006) 17. Liu, J., Ali, S., Shah, M.: Recognizing Human Actions Using Multiple Features. In: CVPR (2008) 18. Jia, K., Yeung, D.: Human Action Recognition using Local Spatio-Temporal Discriminant Embedding. In: CVPR (2008) 19. Perronnin, F.: Universal and Adapted Vocabularies for Generic Visual Categorization. PAMI 30(7), 1243–1256 (2008) 20. Wang, L., Suter, D.: Recognizing Human Activities from Silhouettes: Motion Subspace and Factorial Discriminative Graphical Model. In: CVPR (2007) 21. Liu, J., Shah, M.: Learning Human Actions via Information Maximazation. In: CVPR (2008)
Clustering-Based Descriptors for Fingerprint Indexing and Fast Retrieval Shihua He1, Chao Zhang1, and Pengwei Hao1,2 1
Key Laboratory of Machine Perception(Ministry of Education) Peking University, Beijing 100871, China 2 Dept. of Computer Science, Queen Mary University of London, London E1 4NS, UK
[email protected]
Abstract. This paper addresses the problem of fast fingerprint retrieval in a large database using clustering-based descriptors. Most current fingerprint indexing frameworks utilize global textures and minutiae structures. To extend the existing methods for feature extraction, previous work focusing on SIFT features has yielded high performance. In our work, other local descriptors such as SURF and DAISY are studied and a comparison of performance is made. A clustering method is used to partition the descriptors into groups to speed up retrieval. PCA is used to reduce the dimensionality of the cluster prototypes before selecting the closest prototype to an input descriptor. In the index instruction phase, the locality-sensitive hashing (LSH) is implemented for each descriptor cluster to efficiently retrieve similarity queries in a small fraction of the cluster. Experiments on public fingerprint databases show that the performance suffers little while the speed of retrieval is improved much using clustering-based SURF descriptors. Keywords: fingerprint indexing; fingerprint retrieval; local descriptors; clustering.
1 Introduction In the fingerprint identification mode, the user inputs his fingerprint and the system identifies the potential corresponding ones in the database. Hence fingerprint identification can be performed by searching the database for a match at the coarse level and then finding a match from the candidates one by one. Many ways are developed to narrow the search space without much loss of accuracy, i.e. to give a tradeoff between speed and accuracy. They fall into two categories: classification and indexing [1]. A number of classification approaches classify fingerprints into the human predefined five Henry classes [2]. Since fingerprints exclusively classified into five classes are unevenly distributed, it is a neither effective nor efficient method suitable for the identification system. Fingerprint indexing algorithms, highly related to continuous classification, perform better than exclusive classification by selecting most probable candidates and sorting them by the similarity to the input [3][4][5]. Continuous classification is proposed to deal with the problem of exclusive classification by representing fingerprint with feature vectors [6]. The search is performed by computing the distances between H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 354–363, 2010. © Springer-Verlag Berlin Heidelberg 2010
Clustering-Based Descriptors for Fingerprint Indexing and Fast Retrieval
355
the query and each template to output the closest. Those algorithms are based on two types of features: global features and local features. Global features represent the global pattern of ridges and valleys and the orientation field. They usually detect a core point and then estimate its neighborhood feature. Methods in [3] and [4] are of this category. Yet such approaches are not good at handling partialness and non-linear distortion, which are two major problems in the representation of fingerprint features, due to their lack of differentiation of local textures. In contrast, local features are believed to be robust to partial and distorted prints because they can represent stable local structures. In [5], triplets of minutiae and accessorial information are used in the indexing procedure. Indeed, local features are usually represented by neighborhood texture description around interest points. Therefore, two conditions are to be met [7] [8]. First, interest points should be stably detected under different viewing situations, especially scaling and rotation. Second, a distinctive local descriptor should be calculated independent of the interest points and be appropriate to characterize the regions. However, the retrieval performance may suffer from the small quantity of minutiae points in partial fingerprints. Recently proposed local descriptors extend the existing technology of feature extraction, including the scale invariant feature transform (SIFT) [11], the speeded-up robust features (SURF) [8] and DAISY [9]. For fingerprint retrieval, Shuai et al [10] has incorporated SIFT into the indexing scheme and given better performance than that of other features. It uses composite local features to overcome the poor performance due to partial prints [12]. SIFT features are pruned since a large number of keypoints can reduce an efficient index structure to a sequential search [13]. Different from their methods, our work considers interest point detectors and descriptors independently to extend the framework of feature extraction. SURF and DAISY features are substituted for SIFT features and SURF-128 is proved to be the best suitable one for our task. However, point detectors of SIFT and SURF usually generate hundreds of interest points. Although being pruned to a small set [13], the quantity and dimensionality of overall descriptors of the database is still large and reduces the search speed significantly. The locality-sensitive hashing (LSH) algorithm [14] is used to solve the curse of dimensionality by searching for an approximate answer instead of an exact one. As proposed in [10], all descriptors of the database are hashed, which is still considered not fast enough. In our work, descriptors are clustered to further narrow the search space. This technique is very much like the bag-of-words model applied on visual categorization [15][16]. The K-means clustering method is used, after which the LSH approach is implemented on each cluster to index descriptors. Furthermore, the principal component analysis (PCA) is used on cluster prototypes to generate a subspace. In the on-line mode, the dimensionality of the input descriptor is reduced by projecting onto the subspace before selecting the prototype closest to it. Therefore, search speed is improved greatly and a tradeoff between speed and accuracy is made. The remainder of this paper is organized as follows. Section 2 presents composite sets of reduced SURF and DAISY features as an extension to SIFT features. Section 3 describes the LSH scheme used for index construction. Section 4 details the K-means algorithm for descriptor clustering and the PCA trick for cluster prototypes. Section 5 provides the implementation detail of the proposed method. Section 6 shows experimental results and Section 7 ends this paper with conclusions and future work.
356
S. He, C. Zhang, and P. Hao
2 Composite Set of Reduced SURF and DAISY Features SURF excels at fast detection of interest points and robust description of local regions [8]. For a detector, first, integral images are computed in advance, making the calculation time independent of their sizes. Second, box filters are used to approximate the Hessian matrix, which is low time-consuming without much loss of repeatability. Third, the scale space is analyzed by up-scaling the filter size rather than iteratively reducing the image size, which promotes computational efficiency without aliasing. For a descriptor, the sum of Haar wavelet responses about a keypoint is computed. SURF outperforms SIFT in most cases because SURF integrates the gradient information within a subpatch while SIFT depends on the orientations of the individual gradients. So SURF descriptors can better describe the general characteristics of a local region of a fingerprint and be robust to local distortions, since fingerprint images are largely composed of parallel ridges and valleys, for which integral information along horizontal and vertical axes accounts more than distribution of individual directions. Table 1 provides the computational time needed for SIFT and SURF features. SURF costs much less time and that is an important property for off-line construction of database indexing, and of course, on-line feature extraction as well. In our work, SURF-64 and SURF-128 are used for comparison of performance. The DAISY feature is originally proposed for dense wide-baseline matching purposes [9]. It is said that SIFT is powerful due to the use of gradient orientation histograms relatively robust to distortions. The SURF descriptor approximates them by using integral images to compute the histogram bins, which is computationally effective but does not include SIFT’s spatial weighting scheme. DAISY attempts to strike a balance between computational efficiency and performance by convolving orientation maps to compute the bin values. It replaces the weighted sums of gradient norms by convolutions of the original image with several oriented derivations of Gaussian filters. In our scheme, the difference-of-Gaussian function in SIFT is applied for the keypoint detection to generate stable points and DAISY descriptors are used to describe local patch. SURF detectors can generate hundreds of keypoints per fingerprint, including minutiae points as well as other unique points [8]. Furthermore, local descriptors can characterize the region around keypoints for accurate and efficient matching. Those two properties ensure rich information about a fingerprint image. We believe it is because SURF partly represents Level 3 features [17]. For the task of fingerprint indexing and retrieval, the full set of detected keypoints are found redundant, and a small subset will suffice. Therefore, the scale factor is limited under a threshold and the threshold is varied to discard candidate local peaks, especially the low contrast threshold. Finally top N most significant keypoints ranked by the contrast value are selected. Actually, about 100 points are enough for our task. Table 1. Computational time comparison (in seconds) on Pentium 4 3GHz in MATLAB
Image Size 256 × 364 388 × 374 288 × 384
SIFT 16.69 19.68 23.60
SURF 0.42 1.36 0.65
Clustering-Based Descriptors for Fingerprint Indexing and Fast Retrieval
357
Two major problems occur in registering fingerprint images: one is partialness since multiple impressions of the same finger acquired by sensors may have only a small region of overlap; the other is non-linear plastic distortion due to the effect of pressing a convex elastic surface (the finger) on a solid flat surface (the sensor) and non-uniform pressure applied by subjects. It is demonstrated that a composite template using multiple impressions [10][12] can deal with these problems. The composite set of local features is constructed by integrating the reduced SURF and DAISY features extracted from M randomly impressions of the same finger, while M is determined by a tradeoff between the performance and the size of the database. Here M = 3 is preferred.
3 LSH for Index Construction The locality-sensitive hashing (LSH) is used to index descriptors [14][18]. LSH is an approximate similarity search technique that works efficiently even for highdimensional data. LSH solves the approximate nearest neighbor search problem, termed ε-NNS, in sub-linear time. Given a set P of points in a normed d-dimensional space, we want to preprocess P so as to efficiently return a point p in P for any given query point q, such that d(q,p) ≤ (1+ε)d(q,P), where d(q,P) is the distance of q to its closest point in P. This is accomplished using a set of locality-sensitive hash functions. Given a set of points P and a similarity function sim(x,y), a locality-sensitive hash function family F operates on P, such that for any x, y ∈ P , the probability
Prob h∈F [ h( x) = h( y )] = sim( x, y ) The algorithm transforms the point space P into a Hamming space where L1 distance between points in the original space is preserved. The algorithm builds a set of l hash functions, each of which selects k bits from the bit string. These k bits are hashed to index into the buckets in the hash table. The two parameters, k and l enable us to select an appropriate tradeoff between accuracy and running time. In our experiments, we use k = 200 and l = 20.
4 Clustering of Local Descriptors for Fast Retrieval In our scheme, SURF-128 is selected as the feature descriptor, because experiments on FVC databases show that it performs better than current features for fingerprint retrieval. Fig. 1 shows the experimental results. In detail, from Fig. 1(a), we can see that SURF-128 performs the best. The retrieval accuracy rate reaches above 90% when the retrieval penetration rate is 10% and this is acceptable for fingerprint indexing and retrieval. And Fig. 1(b) is a stronger proof of the high performance of SURF128. As we see, the accuracy rate reaches around 99% when the penetration rate is 10%.
358
S. He, C. Zhang, and P. Hao
2002DB1 100 90
Precision Rate(%)
80 70 60 50 40
SIFT SURF-64 SURF-128 DAISY
30 20
0
10
20
30
40 50 60 Penetration Rate(%)
70
80
90
100
Fig. 1. Performance evaluation. (a) Retrieval results on FVC2002 DB1. (b) Retrieval results of SURF and DAISY compared with other approaches on FVC2000 DB2.
In spite of the excellent performance, the search time for a single fingerprint seems unbearable since the tremendous quantity of all descriptors of the database bottlenecks the speed. A simple solution is to use the divide-and-conquer technique, and clustering these descriptors is a case in point. The whole set of descriptors are partitioned into a number of non-overlapping clusters, on each of which the LSH tables are constructed. This approach is very much like the bag-of-words model for visual categorization [15][16]. To represent an image using the bag-of-words model, an image can be treated as a document. Similarly, “words” in images need to be defined too. To achieve this, it usually includes three steps: feature detection, feature description and codebook generation. A definition of the bag-of-words model can be the histogram representation based on independent features.
Clustering-Based Descriptors for Fingerprint Indexing and Fast Retrieval
359
In our work, the first and second steps are involved in our scheme. The final step is to convert descriptors to codewords, which also produces a codebook. A codeword can be considered as a representative of several similar local regions. Codewords are defined as the centers of the learned clusters, i.e. cluster prototypes. The number of the clusters is the codebook size. Thus, each descriptor is transformed into a certain codeword through the clustering process and the image can be represented by the histogram. We partition all the descriptors into non-overlapping clusters and calculate the prototypes in order to make similar descriptors fall into the same cluster and different ones into different clusters. One simple method to partition these descriptors into clusters is performing Kmeans clustering over all the descriptors. This algorithm proceeds by iteratively assigning points to their closest cluster centers and recomputing the cluster centers. Two difficulties are that the K-means algorithm converges to local optima of the criterion function, and that it does not determine the parameter K. In this case we do not know about the density or the compactness of our clusters. Moreover, we are rather interested in the speed of retrieval than a correct clustering in the sense of feature distributions. So K-means is run several times with different sets of initial cluster centers. In this case, we set K = 150 and run the algorithm 3 times. The high dimension of SURF-128 accounts for the time cost of selecting the closest cluster prototype to an input descriptor. To further improve retrieval speed, PCA is applied to create a more representative and concise subspace on to which we can project those prototypes. In the on-line retrieval mode, an input descriptor is projected onto the subspace to have the same dimensionality as the prototypes. PCA is not used before K-means clustering for two reasons. One is that we try to preserve as much information as SURF-128 can represents, and the other is that the clustering procedure is off-line so that time consumption can be tolerated. Those considerations just reflect again the idea of the tradeoff between accuracy and speed.
5 Implementation of the Algorithms This section describes the implementation details of our scheme based on SURF-128 descriptors. There are mainly two phases: off-line indexing phase and on-line query phase. First, composite sets of reduced descriptors of each fingerprint image in the database are clustered into different groups, and indexed via LSH. Second, the user can perform queries to find the corresponding fingerprint. Table 2. Speed (measured by the average search time in seconds on Pentium 4 3GHz in MATLAB) comparison between the method using the technique of descriptor clustering and PCA for selecting the closest prototype to an input descriptor and that without this technique
2000 DB2 2002 DB1
No clustering 100 fingers 116.92 88.27
Cluster for 100 fingers 34.65 29.81
Cluster for 80 fingers 32.91 32.57
Cluster for 50 fingers 29.79 29.86
Cluster for 20 fingers 26.24 26.16
Cluster for 10 fingers 24.83 24.81
360
S. He, C. Zhang, and P. Hao
Indexing: First, 3 fingerprints are randomly selected from the same finger. An enhancement algorithm is performed to remove noise. Then the reduced local features are extracted. Next, the K-means clustering method partitions the whole set of descriptors into 150 groups. Also, PCA of these prototypes is used to generate a subspace. Finally, the LSH indexing structure is created for each cluster respectively. Query: When the user queries a fingerprint image, enhancement and extraction of reduced SURF-128 features are performed. Then, in search of each descriptor, PCA is used to reduce the dimensionality and the closest cluster prototype is selected. Next, for the descriptors in the cluster that the prototype represents, we calculate the bucket id’s of each keypoint using LSH functions and find out all the keypoints within those buckets to form the candidate list. All these candidate keypoints are checked by L2 norm to find the nearest keypoints of each query keypoint. Finally, a vote for the finger they belong to is done by the retrieved nearest keypoints, which demonstrates the similarity between the query and those in the database.
6 Experimental Results The comparison of performance is evaluated on FVC2000 DB2 and FVC2002 DB1. Both the databases contain images of 100 different fingers with 8 impressions of 500dpi resolution for each finger. The sizes are 256x364 and 388x374 respectively. Retrieval performance and speed are our main focus. For indication of the retrieval performance, retrieval accuracy and efficiency are considered. In our scheme, accuracy is calculated by the percentage of the input fingerprints whose corresponding ones in the database to be correctly retrieved. The retrieval efficiency is indicated by a so-called penetration rate, which is the average percentage of fingerprints in the database retrieved over all input fingerprints. For indication of the retrieval speed, we use average time of retrieving a fingerprint in the database. Fig. 2 describes the performance when descriptors are clustered for off-line indexing, PCA is applied to generate a subspace spanned by the cluster prototypes and dimensionality reduction is used for selecting the closest prototype to an input descriptor during the on-line retrieval mode. From Fig. 2, we can see that performance suffers little after we use these techniques, which are shown to be effective and applicable. For example, in Fig. 2 (a), the difference of accuracy is less than 2% when the penetration rate is 10%. In Fig.2 (a), performance even improves and that indicates the effectiveness of the clustering-based method. Table 2 shows the improvement of retrieval speed using the technique proposed in Section 4. From the 1st row, we can see that speed improves very much for about 3 times if the database contains 100 fingers, each with 3 prints. From the 3rd column to last, we can find that the time cost exhibits almost linear dependence on the data size. Since LSH solves the approximate nearest neighbor search problem in sub-linear time, our technique causes no additional time cost to the order of growth of the running time. Experiments on FVC2002 DB1 also give a positive proof for our technique. The inverse variation of retrieval time versus the size of the database as we can see from the 3rd to 5th column can be explained by the non-uniform distribution of data due to unstable clustering.
Clustering-Based Descriptors for Fingerprint Indexing and Fast Retrieval
361
100
Precision Rate(%)
98
96
94
92
90 SURF-128-K150 SURF-128-NoClustering 88
0
10
20
30
40 50 60 Penetration Rate(%)
70
80
90
100
100
Precision Rate(%)
95
90
85
80 SURF-128-K150 SURF-128-NoClustering 75
0
10
20
30
40 50 60 Penetration Rate(%)
70
80
90
100
Fig. 2. Performance comparison of the method using descriptor clustering for indexing and PCA for dimensionality reduction against the method without these techniques. (a) Retrieval results on FVC2000DB2. (b) Retrieval results on FVC2002DB1.
On the whole, experiments demonstrate that our method using clustering-based descriptors gives a good tradeoff between performance and speed and it is effective and efficient.
7 Conclusions and Future Work We have shown that SURF descriptors can be utilized for fingerprint feature extraction and indexing and SURF-128 is the most powerful descriptor for our task. And the clustering-based composite sets of reduced SURF-128 features for off-line index construction speed up the process of fingerprint retrieval significantly. PCA used for reducing the dimensionality of cluster prototypes and an input descriptor cut the time
362
S. He, C. Zhang, and P. Hao
cost of selecting the closest prototype to the input descriptor in the on-line mode. Furthermore, LSH is proved to be effective for similarity search in high dimensions of SURF-128 vectors. It is possible to further improve the performance of these local descriptors if proper processing is performed, which requires reducing noise in such a way as to preserve the inherent texture information including Level 3 features [17]. Similar descriptors from different fingerprints of the same finger are treated as different ones and that actually enlarge the size of a LSH table. If they can be merged as one representative descriptor to reduce redundancy, speed then has little relation to the number of prints that we use to construct a composite set, and it could be further increased. Our future work will focus on improving the performance of SURF-128 by applying an appropriate preprocessing stage, better description of each keypoints, better clusteringbased method for search space narrowing, and better indexing schemes. Acknowledgments. This work was supported by research funds of NSFC No. 60572043 and the NKBRPC No.2004CB318005.
References 1. Maltoni, D., Miao, D., Jain, A.K., Prabhakar, A.: Handbook of Fingerprint Recognition. Springer, New York (2003) 2. Wilson, C.L., Candela, G.T., Watson, C.I.: Neural Network Fingerprint Classification. J. Artif. Neural Netw. 1(2), 241–246 (1993) 3. Liu, M., Jiang, X., Kot, A.C.: Fingerprint Retrieval for Identification. IEEE Trans. Pattern Recognition 40(6), 1793–1803 (2007) 4. Boer, J.D., Bazen, A.M., Gerez, S.H.: Indexing Fingerprint Databases Based on Multiple Features. In: Proc. ProRISC, 12th Annual Workshop on Circuits, Systems and Signal Processing (2001) 5. Bhanu, B., Tan, X.: Fingerprint Indexing Based on Novel Features of Minutiae Triplets. IEEE Trans. Pattern Analysis and Machine Intelligence 25(5), 616–622 (2003) 6. Cappelli, R., Lumini, A., Miao, D., Maltoni, D.: Fingerprint Classification by Directional Image Partitioning. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 402–421 (1999) 7. Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE Trans. Pattern Analysis and Machine Intelligence 110(3), 1615–1630 (2005) 8. Bay, H., Ess, A., Tuytelaars, T., Gool, L.C.: Speed-up Robust Features (SURF). Computer Vision and Image Understanding 110(3), 346–359 (2007) 9. Tola, E., Lepetit, V., Fua, P.: A Fast Local Descriptor for Dense Matching. In: IEEE Proc. Computer Vision and Pattern Recognition (2008) 10. Shuai, X., Zhang, C., Hao, P.: Fingerprint Indexing Based on Composite Set of Reduced SIFT Features. In: IEEE Int. Conf. on Pattern Recognition (2008) 11. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. Int. Journal of Computer Vision 60(2), 91–110 (2004) 12. Jain, A.K., Ross, A.: Fingerprint Mosaicking. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (2002) 13. Foo, J.J., Sinha, R.: Pruning SIFT for Scalable Near-Duplicate Detection and Sub-Image Retrieval. In: Proc. 18th Australian Database Conf., pp. 63–71 (2007)
Clustering-Based Descriptors for Fingerprint Indexing and Fast Retrieval
363
14. Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via Hashing. In: Proc. 25th VLDB Conf., pp. 518–529 (1999) 15. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual Categorization with Bags of Keypoints. In: Proc. of ECCV International Workshop on Statistical Learning in Computer Vision (2004) 16. Leung, T., Malik, J.: Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textions. International Journal of Computer Vision 43(1), 29–44 (2001) 17. Jain, A.K., Chen, Y., Demirkus, M.: Pores and Ridges: High-Resolution Fingerprint Matching Using Level 3 Features. IEEE Trans. Pattern Analysis and Machine Intelligence 29(1), 15–27 (2007) 18. Ke, Y., Sukthankar, R., Huston, L.: Efficient Near-Duplicate Detection and Sub-Image Retrieval. In: Proc. ACM Multimedia Conf., pp. 869–876 (2004)
Temporal-Spatial Local Gaussian Process Experts for Human Pose Estimation Xu Zhao1, Yun Fu2 , and Yuncai Liu1 1
Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China 2 BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA
Abstract. Within a discriminative framework for human pose estimation, modeling the mapping from feature space to pose space is challenging as we are required to handle the multimodal conditional distribution in a high-dimensional space. However, to build the mapping, current techniques usually involve a large set of training samples in the learning process but are limited in their capability to deal with multimodality. In this work, we propose a novel online sparse Gaussian Process (GP) regression model combining both temporal and spatial information. We exploit the fact that for a given test input, its output is mainly determined by the training samples potentially residing in its neighbor domain in the inputoutput unified space. This leads to a local mixture GP experts system, where the GP experts are defined in the local neighborhoods with the variational covariance function adapting to the specific regions. For the nonlinear human motion series, we integrate the temporal and spatial experts into a seamless system to handle multimodality. All the local experts are defined online within very small neighborhoods, so learning and inference are extremely efficient. We conduct extensive experiments on the real HumanEva database to verify the efficacy of the proposed model, obtaining significant improvement against the previous models.
1 Introduction Recovering human pose from visual signals is a fundamental yet extremely challenging problem in computer vision research. A wide spectrum of real-world applications [1] in control, human computer interaction, multimedia communication, and surveillance scenarios motivate the endeavors to find robust and effective solutions to this problem. Among the large amount of studies on pose estimation, discriminative approaches [2] have recently seen a revival due to their flexible frameworks adapting to different learning methods and the ability of fast inference in real-world databases. Discriminative approaches for human pose estimation [3,4,5,6,7] aim to model the direct mapping from visual observations to pose configurations. The methods range from nearest-neighbor retrieval [8,9] and manifold learning [4] to regression [10,7] and probabilistic mixture of predictors [2,5]. However, all of the discriminative approaches have to face the difficult problem of how to effectively model a multimodal conditional distribution in a high-dimensional space with small size training data. Current techniques to deal with multimodality are mainly in the category of mixture of models. In [2,5], the conditional Bayesian Mixture of Experts (BME) was used to H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 364–373, 2010. c Springer-Verlag Berlin Heidelberg 2010
Temporal-Spatial Local Gaussian Process Experts for Human Pose Estimation
365
represent the multimodal image-to-pose mapping. This model is flexible in modeling multimodality by introducing the input sensitive gate function. However, this parametric model is prone to fail if the input dimension is too high. Moreover, the estimation accuracy of the BME model heavily depends on the distribution of training samples in ambiguous regions, so it is hard to obtain satisfactory results on small size datasets. Recently, a few attempts have been made to estimate human pose by using Gaussian Process (GP) [11] algorithms, within both discriminative [12,7] and generative [13] frameworks. GP regression has proven to be a powerful tool in many applications. In the discriminative models, pose estimation is mainly built on the basis of GP regression. The model defines a prior probability distribution over infinite function space. This leads to a non-linear probabilistic regression framework working along with the kernelized covariance function. The flexibilities in kernel selection and non-parametric nature of GP model are advantageous to find efficient solutions of pose estimation on small size databases [13,7]. However, the full GP regression suffers from two inevitable limitations: relative expensive computing cost and incapability to handle multimodality. To tackle the computing limitations, a lot of efforts have been made on the sparse approximations of full GP [14,11]. These methods use only a subset of training inputs [15] or a set of inducing variables [16] to approximate the covariance matrix. Although the computational expenses are reduced by such approximations, the models still work within the global voting framework and might lack effective mechanisms to avoid the averaging effect. Another kind of method proposed to handle above two limitations is mixture of Gaussian process experts [17,18,19]. Similar to mixture of experts architecture [20], in these models the input space is divided into different regions by a gating network, each of which is dominated by a specific GP expert. In the model, the cubic computing cost on the entire dataset is reduced to that on part of the data. In the meantime, the covariance functions are localized to adapt to different regions. However, learning the mixture GP experts is intimately coupled with the gating network. The determination of gating network is another complex problem. In this paper, we propose a novel mixtures of local GP experts model, utilizing both temporal and spatial information. Our method is inspired by the recent work on human pose inference using sparse GP regression [12]. In their model, the local experts are trained offline and the local regressors are defined online for each test point. Derived from the neighborhood of the test point in the appearance space, each local GP is defined to be consistent in the pose space. Unlike the mixture of GP experts, this model avoids the tremendous efforts in computing the gating network. We generalize the localization strategy in [12] and design the local GP experts model with three contributions: (1) We propose to define the local GP experts in the unified input-output space, therefore each GP expert is composed of samples that are localized in both input and output space. This strategy is different from that proposed in [12], where the neighborhood is defined separately in input and output space. Such scheme prone to fail in dealing with more-to-one mapping because the neighborhood relationship in output space would be changed in the input space. In comparison, our model can flexibly handle the two-way multimodality. (2) We introduce the temporal local GP experts. In the unified space, we integrate the temporal and spatial experts into a whole to make prediction and handle multimodality.
366
X. Zhao, Y. Fu, and Y. Liu
(3) We evaluate the proposed Temporal-Spatial Local (TSL) GP model on the public real HumanEva database [21] and achieve significant improvements against both full GP model and the local sparse GP model.
2 Local Gaussian Process Experts Model 2.1 Gaussian Process Regression Gaussian process is the generalization of Gaussian distributions defined over infinite index sets [11]. Suppose we have a training dataset D = {(xi , yi ), i = 1, · · · , N}, composed of inputs xi and noisy outputs yi . We consider a regression model defined in terms of the function f (x) so that yi = f (xi ) + i , where i ∼ N(0, β−1 ) is a random noise variable and the hyperparameter β represents the precision of the noise. From the Gaussian assumption of prior distribution over functions f (x), the joint distribution of outputs Y = [y1 , · · · , yN ]T conditioned on input values X = [x1 , · · · , xN ]T , is given by p(Y|X) = p(Y|f, X)p(f|X)df = N(Y|0, K), (1) where f = [ f1 , · · · , fN ]T , fi = f (xi ) and the covariance matrix K has elements Ki, j = k(xi , x j ) + β−1 δi j ,
(2)
where δi j is the Kronecker delta function. In this paper, we use a kernel function k which is the sum of an isotropic exponential covariance function, a noise term and a bias ¯ For a new test input x∗ , the conditional distribution, term, all with hyperparameters, θ. p(y∗ |X, Y, x∗ ) = N(μ, σ), is a Gaussian distribution with mean and covariance given by −1 μ(x∗ ) = k∗,ζ K−1 ζ,ζ Yζ , σ(x∗ ) = k∗,∗ − k∗,ζ Kζ,ζ kζ,∗ ,
(3)
where ζ’s are the indices of the N training inputs, Kζ,ζ is the covariance matrix with elements given by (2) for i, j = 1, . . . , N, the vector kζ,∗ = kT∗,ζ is the cross-covariance of the test input and the N training inputs, and scalar k∗,∗ = k(x∗ , x∗ ) + β−1 is the covariance of the test input. Note that the mean (3) can be viewed as a weighted voting from N training outputs μ(x∗ ) =
N
wn yn ,
(4)
n=1
where wn is the nth component of k∗,ζ K−1 ζ,ζ . With this insight, we can view the GP regression as a voting process, where each training output has a weighted vote to determine what the test output should be. 2.2 Local Mixture of GP Experts To reduce the computing cost and handle multimodality, we need to sparsify the full GP regression model. Current GP sparse techniques [11,14] mainly focus on globally sparsifying the full training dataset based on some selection criteria such as online learning
Temporal-Spatial Local Gaussian Process Experts for Human Pose Estimation
367
[22], greedy posterior maximization[23], maximum information gain [24], and matching pursuit [25]. By using this kind of methods, the computational complexity of full GP , O(N 3 ), are reduced to O(m3 ) or O(Nm2 ), where N, m are the sizes of full training dataset and the selected subset respectively. However, for very large database, the reduction is not enough. Moreover, these ideas still work within the global voting framework. That means for every test input, no matter which local distribution mode they belong to, the selection of the training samples and covariance function are global. Actually, for a special test input, the training samples in its neighborhood usually have more impacts on the prediction than those far from it. In voting view, the weights of the local voters are bigger than others (see (4)). In the GP model, kernel function provides a metric to measure the similarity between the inputs. Ideally, this metric should be adjusted dynamically to adapt different local regions. Motivated by above considerations, we develop the local mixture of GP experts. Like the model proposed in [12], for a given test input, we select different local GP experts in its neighborhood. The training samples of each expert are also selected locally. These local experts build up a local mixture GP experts system to make the prediction. To this end, in our model, the mean prediction for a given test input x∗ is given by μ(x∗ ) =
T i=1
πi k∗,ζi K−1 ζi ,ζi Yζi =
T S
πi wi j yi j ,
(5)
i=1 j=1
where T is the number of local experts, S is the size of each expert, ζi is the index set of samples for the i-th expert, πi is the prediction weight of the i-th expert and yi j is the j-th training output belonging to the i-th expert, wi j is its weight. Both T and S are parameters of our model. In practice, small values are sufficient for accurate predictions. Each πi is set to be a function of the inverse variance of the expert’s prediction. Different from the localization strategy in [12], our model define the neighborhood in the input-output unified space U, where the data points are the concatenation of the input and output vector. The advantages of our strategy are two folds: (1) The neighborhood relationship is closer to the real distribution in U than in the single input and output space. For example in pose estimation, two image feature points which are very close in feature space might be quite different in pose space, and vice versa. In U, this kind of ambiguity can be avoided to a large extent. (2) Our strategy can deal with two-way multimodal distributions. For the more-toone input to output mapping, the data points would be scattered in the input space just using the neighborhood definition in the output space. But in U, this situation can be avoided. In implementation, the unified data space U is divided into R different local regions with a clustering algorithm. Each region is dominated by a local GP expert trained offline. Given a test input, starting from its neighborhood in the input space, we find its local neighbors in U to build the local mixture of GP experts model. The algorithm is summarized in Algorithm 1, where, the data set in U is represented as D = [d1 , · · · , dN ] with di = (xi , yi ). The function findNN(X, x, S ) finds S nearest neighbors of x in X. The function kmeans(D, R) performs k-means clustering on data set D and returns the R centers CR and clusters DR .
368
X. Zhao, Y. Fu, and Y. Liu
Algorithm 1. Local mixture of GP experts: learning and inference 1. OFFLINE: Training of the Local Experts 2. R: number of local GP experts (CR , DR ) = kmeans(D, R) 3. for i = 1 . . . R do 4. {θ¯i } ⇐ min(− ln p(YRi |XRi , θ¯i )) 5. end for 6. ONLINE: Inference of test point x∗ 7. T : number of experts, S : size of each expert 8. η = findNN(X, x∗ , T ) 9. for j = 1 . . . T do 10. ζ = findNN(D, dη j , S ) 11. t = findNN(CR , dη j , 1) 12. θ¯ = θ¯t 13. μ j = k∗,ζ K−1 ζ,ζ Yζ σ j = k∗,∗ − k∗,ζ K−1 ζ,ζ kζ,∗ 14. end for 15. p(y∗ |X, Y) ≈ Ti=1 πi N(μi , σ2i )
3 Temporal-Spatial Local GP Experts In order to handle multimodality more effectively, on the basis of the spatial experts, we introduce the temporal experts to construct the temporal-spatial combined mixture of GP experts model. In this model, the spatial local experts learn the relationship between the input space and output space, the temporal local experts explore the underlying context of the output space. Suppose we work with sequential data. By adding the temporal constrain, the regression models can be formulated as yt = f (xt ) + x,t and yt = g(yt−1 ) + y,t ,
(6)
−1 where t is the temporal tag, x,t ∼ N(0, β−1 x ) and y,t ∼ N(0, βy ) are noise processes. We use the first-order Markov dynamical model to account for the dependence in the output space. For (6), considering dynamic mapping on the data set Y = [y1 , · · · , yN ]T in the output space, the joint distribution of Y is given by
p(Y) = p(y1 )
N
p(yt |yt−1 , g)p(g)dg,
(7)
t=2
where g = [g1 , · · · , gN−1 ]T , gi = g(yi ). In view of the nonlinear dynamical nature of human motion, we use an RBF plus linear kernel θ 1 k(yi , y j ) = θ0 exp − yi − y j 2 + θ2 + θ3 yTi y j . (8) 2 To build the local temporal experts model, we use similar localization strategy described in Algorithm 1. Once the local temporal experts give the prediction yˆ , we proceed to
Temporal-Spatial Local Gaussian Process Experts for Human Pose Estimation
369
Algorithm 2. Online inference with temporal-spatial local GP experts Require: x∗t , yˆ t : the output at last time instant 1. COMBINATION of two class of local experts 2. T 1 : number of spatial experts T 2 : number of temporal experts S : size of each expert 3. η(s) = findNN(X, x∗t , T 1 ); 4. dˆ t = (x∗t , yˆ t ); 5. η(t) = findNN(D, dˆ t , T 2 ); 6. η = η(s) ∪ η(t) ; 7. ONLINE inference 8. T = T 1 + T 2 : number of all experts 9. for j = 1 . . . T do 10. ζ = findNN(D, dη j , S ) 11. t = findNN(CR , dη j , 1) 12. θ¯ = θ¯t 13. μ j = k∗,ζ K−1 ζ,ζ Yζ σ j = k∗,∗ − k∗,ζ K−1 ζ,ζ kζ,∗ 14. end for 15. p(y∗t |X, Y) ≈ Ti=1 πi N(μi , σ2i )
make the prediction supported by the the local spatial experts in the unified space U. Formally this process is described by p(yt |yt−1 , xt ) = p(yt |ˆyt , xt )p(ˆyt |yt−1 )dyˆ t . In summary, we can build up the temporal-spatial combined local GP model as follows. Given the training data set X = [x1 , · · · , xN ]T and Y = [y1 , · · · , yN ]T , we firstly learn a set of hyperparameters {θ¯i } for the local spatial GP experts following the process described in the offline part of Algorithm 1. Then, the local temporal models is also built up by the same way using the training data Y1 = [y1 , · · · , yN−1 ]T and Y2 = [y2 , · · · , yN ]T . At the time instant t − 1, one can get the prediction yˆ t under the process of local temporal experts model. Then, at the time instant t, we import x∗t and yˆ t into our temporal-spatial combined local experts model to get the final prediction y∗t . The algorithm is described in Algorithm 2. Computational complexity. We compare the computational complexity of our models with that of full GP in Table 1. Note that for both learning and inference, our models are linear in N stemming from the operators of finding nearest neighbors (O(RN)) and kmeans clustering (O(RdN)). The complexity of inverting the local GP is not a function of the number of examples, since the local GP experts are of fixed size. When N S , Table 1. Computational complexity: both local models are linear in N for both learning and inference, where d is the dimension of the data points. In experiments, T, S , R N Full GP Local Sparse GP Experts Temporal-Spatial Local GP Experts Learning O(N 3 ) O(RS 3 + R(d + 1)N) O(2RS 3 + 2R(d + 1)N) 3 3 Inference O(N ) O(T S + T N) O(T S 3 + T N)
370
X. Zhao, Y. Fu, and Y. Liu
the computational cost is significantly reduced. Moreover, in general, R is also a small value comparing to N, therefore the complexity of our model is much smaller than that of full GP. It’s computational beneficial in dealing with very large size databases.
4 Experiments 4.1 Regression on the Multimodal Functions In this experiments, full GP, local sparse (LS) GP (Algorithm 1) and Temporal-Spatial Local (TSL) GP (Algorithm 2) are tested on two sets of toy data (see the caption of Fig. 1 for the detailed description of the data set). The regression results are shown in Fig. 1. We can find that for the multimodal function (first row of Fig. 1), the full GP just globally averages the outputs of different modes. The local sparse GP can partly handle the multimodality and avoid the averaging effect but the outputs frequently skip between different modes in the multimodal regions (see Fig. 1 (b)). Therefore it’s hard to get a smooth prediction. This problem can be fixed in the TSL GP model due to the utilization of temporal information. Notice that in Fig. 1(c), the skips are eliminated and the prediction is smooth. Another data set provides a unimodal input-to-output mapping. The regression results are illustrated in Fig. 1(d-f). In this situation, the full GP
Local online GP
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 Training samples Test samples
0 −0.2
0
0.2
0.4
0.6
0.8
1
Output y
1 0.9
0.1
0.3 0.2 Training samples Test samples
0.1 0 −0.2
1.2
0.5 0.4
0
0.2
0.4
Input x
0.6
0.8
1
(a)
0
1 0.5
−0.5
−1
−1
−1.5
−1.5
−1.5
−2
−2
−2
−2.5
−2.5
−2.5
(d)
2
4
6
8
1.4
0 Output y
Output y
0 −0.5
−1
−3 −8
1.2
Training samples Test samples
1.5
1 0.5
0
1
(c)
1.5
−0.5
0.8
TSC local online GP
1
0 Input x
0.6
2 Training samples Test samples
0.5
−2
0.4
Local online GP
1.5
−4
0.2
Input x
2 Training samples Test samples
−6
0
(b)
Full GP
−3 −8
Training samples Test samples
0.1 1.2
Input x
2
Output y
TSC local online GP
1 0.9
Output y
Output y
Full GP 1 0.9
−6
−4
−2
0 Input x
(e)
2
4
6
8
−3 −8
−6
−4
−2
0 Input x
2
4
6
8
(f)
Fig. 1. Model comparisons between full GP, Local Sparse GP, and TSL GP on two sets of illustrative data.The first data set is consists of about 200 training pairs of (x, y), where y generated uniformly in (0, 1) and evaluated as x = y + 0.3 sin(2πy) + , with drawn from a zero mean Gaussian with standard deviation 0.05. Notice here p(y|x) is multimodal. Test points (Nt = 200) are sampled uniformly from (0, 1). The second data set is obtained by sampling (N = 100) a GP with covariance matrix obtained from an RBF. About 200 test inputs are sampled uniformly in (−7.5, 7.5). The regression results are shown in: (a,d) Full GP. (b,e) Local Sparse GP. (c,f) TSL GP. For better viewing, please see enlarged color pdf file.
Temporal-Spatial Local Gaussian Process Experts for Human Pose Estimation
371
Table 2. Average RMS error (in degree) over all joint angles for walking, box, and jog actions of the three subjects. The performances of four regression models are evaluated. S1 Walking Box Full GP
Jog
S2 Walking Box
Jog
S3 Walking Box
Jog
6.4155 5.9226 6.3579 5.6821 5.0510 3.5510 7.0424 7.2785 2.5060
LS-GP(S) 6.6130 5.6815 6.2040 5.4859 4.9035 3.4183 6.9563 7.0344 2.5219 LS-GP(U) 6.3567 5.5951 6.1352 5.4498 4.6334 3.2458 6.7356 6.9226 2.3725 TSL-GP
5.5846 5.2913 5.0348 4.7816 4.4119 2.5085 6.0349 6.2152 2.0682
gives perfect results because the global voting mechanism can deal with the unimodal mapping very well. Here, the local sparse GP also gives good results although there still exist some jitters. The prediction of TSL GP is smoother than that of the LS GP model. 4.2 Results on the HumanEva Database We also validate our models on the HumanEva database [21]. The database provides synchronized video and motion capture streams. It contains multiple subjects performing a set of predefined actions with repetitions. The database was originally partitioned into training, validation, and testing sub-sets. We use sequences in the original training sub-set for training and those in the original validation sub-set for testing. A total of 2,932 frames for walking motion, 2,050 frames for jog motion, and 1,889 frames for box motion are used. The pose is represented by Euler angles and the dimension of the pose space (output space) is 26. We use patch-based image feature described by the SIFT descriptor on the dense interest points with position information. The dimension of the feature space is 100. All the images we used are captured by the camera C1. We report the mean RMS absolute difference errors between the true and estimated joint angles, in degrees. The performance of four models: full GP, LS-GP(U) defined in the unified space, LS-GP(S) defined in separate space (proposed in [12]), and TSLGP are evaluated. In the experiments, we take the values of R, T, S as 100, 10, 50, respectively. The results are reported in Table 2. It is obvious that the TSL-GP model outperforms other models with significant improvements. Other two local GP models are slightly better than full GP. We also find that in the unified space, the local GP gets some performance improvement although it is not very distinct. Fig. 2-3 show the performance comparisons between three models: full GP, LS-GP(U), and TSL-GP with relative errors (normalized by the range of variations of the joint angles), where the errors are averaged over all the subjects but specified for the three actions. For most of joint angles, the TSL-GP model get the best performance. The performance of LSGP(U) is better than that of full GP model. In Fig. 4, the estimation results and ground truth of two joint angles over the whole sequence in walking and jog action are plotted. We compare the results of full GP and TSL-GP. It can be observed that the curves of the TSL-GP model are more smooth and close to the ground truth than the full GP model by using the temporal information.
372
X. Zhao, Y. Fu, and Y. Liu
Full−GP LS(U)−GP TSL−GP
Relative Error
0.3 0.25 0.2 0.15 0.1 0.05 0 0
5
10
15 Angle
20
25
Fig. 2. Performance comparison of three models for the walking action
Full−GP LS(U)−GP TSL−GP
0.25 0.2 0.15 0.1 0.05
Full−GP LS(U)−GP TSL−GP
0.3 Relative Error
Relative Error
0.3
0.25 0.2 0.15 0.1 0.05
0 0
5
10
15 Angle
20
0 0
25
5
10
15 Angle
20
25
Fig. 3. Performance comparisons of three models for the jog (left) and box (right) actions
Full−GP TSL−GP Ground Truth
200 150 100 50 0
50 Joint Angle in Degree
Joint Angle in Degree
250
Full−GP TSL−GP Ground Truth
0
−50 −100 0
50
100
150
200 250 Frame
(a)
300
350
400
450
−50 0
50
100
150
200 Frame
250
300
350
400
(b)
Fig. 4. Curve comparisons of joint angles: ground truth and estimations with TSL-GP and Full GP regression. (a) Left shoulder (x-axis) of subject S2 in walking action. (b) right hip (x-axis) of subject S3 in jog action.
5 Conclusions In this paper, we presented a novel temporal-spatial combined local GP experts model for efficient estimation of 3D human pose from monocular images. The proposed model is essentially a kind of mixture of GP experts in which we integrate both spatial and temporal information into a seamless system to handle multimodality. The local experts are trained in the local neighborhood. Different from previous work, the neighbor relationship is defined in the unified input-output space in this model. Therefore we can flexibly handle two-way multimodality. Both spatial and temporal local experts are defined online within very small neighborhoods, so learning and inference are extremely efficient. We conducted the experiments on the real HumanEva database to validate the efficacy
Temporal-Spatial Local Gaussian Process Experts for Human Pose Estimation
373
of the proposed model and achieved accurate results. This model is general purposed therefore its adaption to other problems is straightforward.
References 1. Moeslund, T., Hilton, A., Kr¨uger, V.: A survey of advances in vision-based human motion capture and analysis. CVIU 104(2-3), 90–126 (2006) 2. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3d human motion estimation. In: CVPR (2005) 3. Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. PAMI 28(1), 44–58 (2006) 4. Elgammal, A., Lee, C.: Inferring 3D body pose from silhouettes using activity manifold learning. In: CVPR (2004) 5. Ning, H., Wei, X., Gong, Y., Huang, T.: Discriminative learning of visual words for 3D human pose estimation. In: CVPR (2008) 6. Bissacco, A., Yang, M.H., Soatto, S.: Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In: CVPR (2007) 7. Zhao, X., Ning, H., Liu, Y., Huang, T.: Discriminative estimation of 3D human pose using Gaussian processes. In: ICPR (2008) 8. Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parameter-sensitive hashing. In: ICCV (2003) 9. Tomasi, C., Petrov, S., Sastry, A.: 3d tracking= classification+ interpolation. In: ICCV (2003) 10. Agarwal, A., Triggs, B.: 3D human pose from silhouettes by relevance vector regression. In: CVPR (2004) 11. Rasmussen, C., Williams, C.: Gaussian processes for machine learning (2006) 12. Urtasun, R., Darrell, T.: Local probabilistic regression for activity-independent human pose inference. In: CVPR (2008) 13. Urtasun, R., Fleet, D., Hertzmann, A., Fua, P.: Priors for people tracking from small training sets. In: ICCV (2005) 14. Qui˜nonero-Candela, J., Rasmussen, C.: A unifying view of sparse approximate Gaussian process regression. JMLR 6, 1939–1959 (2005) 15. Lawrence, N., Seeger, M., Herbrich, R.: Fast sparse Gaussian process methods: The informative vector machine. In: NIPS (2003) 16. Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: NIPS (2006) 17. Rasmussen, C., Ghahramani, Z.: Infinite mixtures of Gaussian process experts. In: NIPS (2002) 18. Tresp, V.: Mixtures of Gaussian processes. In: NIPS (2000) 19. Meeds, E., Osindero, S.: An alternative infinite mixture of Gaussian process experts. In: NIPS (2001) 20. Jacobs, R., Jordan, M., Nowlan, S., Hinton, G.: Adaptive mixtures of local experts. Neural Computation 3(1), 79–87 (1991) 21. Sigal, L., Black, M.: Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Tech.Report CS-06-08, Brown University (2006) 22. Csato, L., Opper, M.: Sparse on-line Gaussian processes. Neural Computation 14(3), 641– 668 (2002) 23. Smola, A., Bartlett, P.: Sparse greedy Gaussian process regression. In: NIPS (2001) 24. Seeger, M., Williams, C., Lawrence, N.: Fast forward selection to speed up sparse Gaussian process regression. In: Proc. of the Ninth Int’l Workshop on AI and Statistics (2003) 25. Keerthi, S., Chu, W.: A matching pursuit approach to sparse Gaussian process regression. In: NIPS (2006)
Finger-Vein Recognition Based on a Bank of Gabor Filters Jinfeng Yang, Yihua Shi, and Jinli Yang Tianjin Key Lab for Advanced Signal Processing Civil Aviation University of China, P.O. Box 9, Tianjin, China, 300300
[email protected]
Abstract. This paper presents a new finger-vein based method of personal identification. A reliable finger-vein region suitable for recognition is first acquired using our homemade imaging system. To exploit the finger-vein characteristics with high randomicity, a bank of Gabor filters specific to finger-vein analysis is then designed. Based on the spatial filtered images, finger-vein feature vectors are constructed for describing finger-vein characteristics in two filter scales. Finally, a fusion scheme in decision level is adopted to improve the reliability of identification. Experimental results are given to show the effectiveness of the proposed method in personal identification. Keywords: Biometrics, finger-vein recognition, Gabor filter.
1
Introduction
Finger veins are subcutaneous structures that randomly develop into a network and spread along a finger [1]. This physiological property makes the finger-vein characteristic very suitable for biometric applications. So exploiting finger-vein features for personal identification is becoming a new hot topic in biometricsbased identification research community. Compared with other traditional biometric characteristics (such as face, iris, fingerprints, etc.), finger veins exhibit some excellent advantages in application. For instance, apart from uniqueness, universality, active liveness, permanence and measurability, finger-vein based personal identification systems can be immune to counterfeit fingers and noninvasive to users. Hence, the finger-vein recognition technology is widely considered as the most promising biometric technology in future. As finger veins are internal, visible lights are incapable of imaging them. To visualize veins in a finger, the near infrared(NIR) lights (760-850nm) are often used in finger-vein image acquisition systems, since they can penetrate a finger, and be absorbed greatly by the deoxyhemoglobin in veins [2], [3]. In our application, a homemade finger-vein image acquisition system is shown in Fig.1. The luminaire contains main NIR light-emitting diodes (LEDs) and two additional LEDs at 760 nm wavelength, and a CCD sensor was place under a finger. To reduce the variations of imaging poses, a position sensor is set to light H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part I, LNCS 5994, pp. 374–383, 2010. c Springer-Verlag Berlin Heidelberg 2010
Finger-Vein Recognition Based on a Bank of Gabor Filters
375
W Main LEDs
Output
Finger
Additional LEDs
CCD
Additional LEDs
Position sensor
Fig. 1. The proposed principle of a homemade finger-vein imaging system
an indicator lamp when a finger is available. In practice, the captured finger-vein images in NIR-light imaging modes are with low contrast due to light attenuation arising from absorbtion of other tissues in a finger. Nowadays, many works have been done for finger-vein-based personal identification [4], [5], [6], [7], [8]. In previous approaches, segmenting the finger-vein network is a common way for a finger-vein feature extraction task. However, the segmentation results are often unsatisfying and greatly sensitive to noises due to low qualities of finger-vein images. Considering the high performance of Gabor filter in face [9], iris [10], fingerprints [11] and palmprint [12] recognition, a feature extraction method based Gabor filters specific to finger-vein images is proposed in this paper. Instead of trying to segment the finger-vein network laboriously, we use a bank of Gabor filters to analyze the underlying finger-vein characteristics, and both local and global finger-vein features are extracted conveniently. To improve the reliability of identification, the finger-vein features are exploited at two scales of Gabor filters, and a fusion scheme in decision level is adopted accordingly. Experimental results show that the proposed method performs well in personal identification.
2
Finger-Vein Image Preprocessing
Image preprocessing is always a key step for feature extraction. In this section, only some basic techniques in image normalization, enhancement and noise reduction are discussed considering that finger-vein segmentation is off motivation. 2.1
Finger-Vein Region Localization
In the proposed finger-vein image acquisition system, it is convenient to crop a finger-vein image region from the CCD image plane using a preset window (denoted by W1 in Fig. 2). In our experiment, we find only a part of the whole region cropped by W1 provides valuable information for recognition (see the middle of Fig. 2). In finger tip, most of the vein vessels usually vanish gradually. So, a reliable subimage for finger-vein recognition is cropped using a location window (denoted by W2 in Fig. 2). W2 is with a fixed height and an adjustable
376
J. Yang, Y. Shi, and J. Yang
W1
W2
Cropping
p2 Cro
ppi ng
p1
Fig. 2. Finger-vein image acquisition. Left: An image captured by a CCD sensor. Middle: Finger-vein image cropped by W1 . Right: Finger-vein region cropped by W2 .
width in order to locate a reliable finger-vein region correctly. According to the variation of pixel values, it is effortless to locate two points (denoted by p1 and p2 , respectively) when the two sides of W2 (denoted by the two dash line) move bilaterally along the width of W1 , as shown in the middle of Fig. 2. 2.2
Finger-Vein Image Normalization and Contrast Enhancement
The fingers vary greatly in figuration not only from different people but also from an identical individual. The cropped subimages by W2 therefore are various in size. For the purpose of achieving more accurate identification results, all subimages are resized to 180 × 100 aspect. Generally, the finger-vein image is with low contrast and non-uniform illumination due to the finger-vein imaging manner and the variation in finger thickness. This is not helpful for feature extraction. To improve the contrast of the normalized image as well as compensate the nonuniform illumination in an automatic manner, a nonlinear method proposed in [13] here is modified to its opposite direction and used to correct pixels adaptively. To reduce the noises generated by image operation, the median filter with a 3 × 3 mask is used accordingly. Fig. 3 shows the finger veins appear clearer than before preprocessing.
Enhancement
Denoising
Fig. 3. Contrast enhancement and noise reduction of a normalized image
3
Finger-Vein Feature Extraction
In the spatial domain, Gabor filters have been widely used for content-based image analysis, and have been demonstrated that they were powerful in capturing some specific texture characteristics in images. A bank of Gabor filters therefore is designed to acquire the finger-vein features in the spatial domain.
Finger-Vein Recognition Based on a Bank of Gabor Filters
3.1
377
Gabor Filter
A two-dimensional Gabor filter is a composite function with two components: a Gaussian-shaped function and a complex plane wave [14]. It is defined as follows γ 1 x2θ + γ 2 yθ2 G(x, y, γ, σ, f0 , θ) = exp − exp(ˆj2πf0 xθ ) (1) 2πσ 2 2 σ2 where
xθ yθ
cos θ sin θ = − sin θ cos θ
x y
(2)
√ , ˆj = −1, θ is the orientation of a Gabor filter, f0 denotes the filter center frequency, σ and γ respectively represent the standard deviation (often called scale) and aspect ratio of the elliptical Gaussian envelope, xθ and yθ are rotated versions of the the coordinates (x, y) of a Gabor filter. Determining the values of the four parameters f0 , σ, γ and θ usually play an important role in making Gabor filters suitable for some specific applications. Using Euler formula, Gabor filter can be decomposed into a real part and an imaginary part. The real part, usually called even-symmetric Gabor filter (denoted by Ge• (•) in this paper), is suitable for ridge detection in an image, while the imaginary part, usually called odd-symmetric Gabor filter, is beneficial to edge detection. Since the finger veins appear dark ridges in image plane, evensymmetric Gabor filter here is used to exploit the underlying features from the finger-vein network. Certainly, the orientations and diameters of the finger veins are changeful in practice. A bank of even-symmetric Gabor filter with two scales, eight channels and eight center frequency therefore is designed to spatially operate a normalized finger-vein image. The even-symmetric Gabor filter is represented as
2 2 2 x + γ y 1 θ θ k k Gemk (x, y, γ, σm , fmk , θk ) = ρ exp − cos(2πfmk xθk ) (3) 2 2 σm 2 where ρ = γ/2πσm , m(= 1, 2) is the scale index, k(= 1, 2, · · · , 8) is the channel index, θk (= kπ/8) and f•k respectively denote the orientation and the center frequency of a even-symmetric Gabor filter at the kth channel. Assume that I(x, y) denote a normalized finger-vein image, Fγ,σm ,fmk ,θk (x, y) denote a filtered I(x, y), we can obtain
Fγ,σm ,fmk ,θk (x, y) = Gemk (x, y, γ, σm , fmk , θk ) ∗ I(x, y)
(4)
where ∗ denotes convolution in two dimensions. Thus, for a normalized fingervein image, sixteen filtered images are generated by a bank of Gabor filters. Some results are shown in Fig. 4. 3.2
Finger-Vein Feature Vector
According to the above discussion, the outputs of Gabor filters at the mth scale forms an 8-dimensional vector at each point in I(x, y). Considering the
378
J. Yang, Y. Shi, and J. Yang
Fig. 4. The output of the 2D convolution of a normalized finger-vein image and evensymmetric Gabor filters
discriminabilities of finger-vein images arise from the diversities of the fingervein networks, an 8-dimensional vector based on the statistical information in a 10 × 10 small block of a filtered image is constructed instead of a pixel-based vector. Thus, for a normalized finger-vein image, 360([18 × 10] × 2) vectors can be extracted from 180 blocks filtered at two scales. Assume that H18×10 represent the block matrix of a filter image, the statistics based on a block Hij (a component of H in the ith column and the jth row, where i = 1, 2, · · · , 10 and j = 1, 2, , 18) can be computed. Here, the average absolute mk deviation(ADD)δij of the magnitudes of Fγ,σm ,fmk ,θk (x, y) corresponding to Hij is used and calculated as ⎧ 1 mk ⎪ |Fγ,σm ,fmk ,θk (x, y)| − μmk ij ⎨δij = K Hij (5) 1 mk ⎪ |Fγ,σm ,fmk ,θk (x, y)| ⎩μij = K Hij
where K is the number of pixels in Hij , μmk ij is the mean of the magnitudes of Fγ,σm ,fmk ,θk (x, y) in Hij . Some results are shown in Fig. 5.
Fig. 5. The average absolute deviations (AADs) in [18 × 10] × 8 blocks
Thus, the vector matrix at the mth scale of Gabor filter can be represented by Vm
⎡→ − vm 11 ⎢ = ⎣ ... → − vm M1
⎤ − ··· → vm 1N .. ⎥ → − vm . ⎦ ij → − m ... v MN
→ m1 mk m8 , where− vm ij = [δij , · · · , δij , · · · , δij ].
(6)
18×10
According to Eq. 6, uniting all the vectors together can form a 2880(360 × 8) dimensional vector. This is not beneficial for feature matching besides losing global
Finger-Vein Recognition Based on a Bank of Gabor Filters
379
information of veins. Hence, based on Vm , a new feature matrix is constructed as follow ⎡ − →m − →m ⎤ − m → vm 11 α1(1,2) · · · v 1j α1(j,j+1) · · · v 1N ⎢ ⎥ .. .. .. .. .. .. .. ⎢ ⎥ . . . . . . . ⎢ ⎥ ⎢ − ⎥ → → − → − m m m Um = ⎢ v m (7) i1 αi(1,2) · · · v ij αi(j,j+1) · · · v iN ⎥ ⎢ ⎥ .. .. .. .. .. .. .. ⎢ ⎥ ⎣ ⎦ . . . . . . . → → → − v m αm · · · − vm α · · · − vm M1
M(1,2)
Mj
M(j,j+1)
MN
18×19
where M = 18, N = 10, • denotes the Euclidean norm of a vector, αm i(j,j+1) is the angle of two adjacent vectors in the ith row. Obviously, the vector norms can well represent the local features, and the angles are capable of describing the relations of two adjacent blocks. Hence, matrix Um is able to represent the local and global features of a finger-vein network at the mth scale. For convenience, the components of matrix Um are rearranged by rows to form a 1D feature vector Rm , which here is called a finger-vein feature vector (FFV), → →m − →m − m T Rm = [− vm 11 , α1(1,2) , · · · , v ij , αi(j,j+1) , · · · , v MN ]1×342 . (i = 1, 2, · · · , M ; j = 1, 2, · · · , N ; m = 1, 2)
4
(8)
Finger-Vein Recognition
As face, iris, and fingerprints recognition, finger-vein recognition is also based on pattern classification. Hence, the discriminability of the proposed FFV determines its reliability in personal identification. To test the discriminability of the extracted FFVs at the mth scale, here the nearest center classifier is adopted for classification. The classifier is defined as ⎧ ς ⎨τ = arg min ϕ(Rm , Rm ) Rςm ∈Cς (9) ⎩ϕ(Rm , Rς ) = 1 − RT Rς Rm Rς m m m m ς where Rm and Rm respectively denote the feature vector of an unknown sample and the ςth class, Cς is the total number of templates in the ςth class, • ς indicates the Euclidean norm, and ϕ(Rm , Rm ) is the cosine similarity measure. ς Using similarity measure ϕ(Rm , Rm ), the feature vector Rm is classified into the τ th class. According to the above section, we extract FFVs in two filter scales considering the variability of the finger-vein network. Therefore, fusion of the matching results based on R1 and R2 may improve the performance of identification. Nowadays, many approaches have been proposed in multi-biometrics fusion, such as Bayes algorithm, KNN classifier, OS-Rule, SVM classifier, Decision Templates algorithm, Dempster-Shafer (D-S) algorithm. Compared to other approaches, the D-S evidence theory works better in integrating multiple evidences for decision making. Aiming to weaken the degree of evidence confliction, we have proposed
380
J. Yang, Y. Shi, and J. Yang
an improved scheme in [15], and obtained better fusion results for fingerprints recognition. Hence, we also adopted the D-S theory here to implement fusion in decision level for finger-vein identification. Details on D-S theory can be found in [16], [17].
5
Experiments
We build a finger-vein image database which contains 2100 finger-vein images from 70 individuals. Each individual contributes 30 finger-vein images from three different fingers: forefinger, middle finger and ring finger (10 images per finger) of the right hand. 5.1
Parameter Selection of Gabor Filters
For a specific application, the parameters f0 , σ, γ and θ usually govern the optimal out-put of Gabor filter (see Eq. 1). Therefore, the three should be determined properly to ensure the discriminability of the FVCodes in classification. Considering that both the diameters and the spread manners of vessels hold high random characteristics, γ is set to one (i.e., Gaussian function is isotropic) for reducing diameter deformation arising from elliptic Gaussian envelop, θ varies from zero to π with a π/8 interval (that is, the even-symmetric Gabor filters are embodied in eight channels), and the center frequency fmk varies with channels(orientations). To determine the relation of σ and fmk , a scheme proposed in [14] is used here, which is defined as follow 1 ln 2 2φmk + 1 σfmk = · (10) π 2 2φmk − 1 where φmk (∈ [0.5, 2.5]) denotes the spatial frequency bandwidth (in octaves) of a Gabor filter at the kth channel and m scale. In our experiments, σ1 is set to five pixel width, σ2 is set to six pixel width, and the constrain to φ• is imposed as ⎧ φ•1 < φ•5 < φ•2 < φ•3 < φ•4 ⎪ ⎪ ⎪ ⎨φ = φ •2 •8 (11) ⎪ φ = φ •3 •7 ⎪ ⎪ ⎩ φ•4 = φ•6 5.2
Experimental Results
Due to the high randomicity of the finger-vein networks, the discriminability of the proposed FFV may embody not only in different individuals but also in different fingers of an identical individual. So, to investigate the differences among forefinger, middle finger and ring finger, 5 finger-vein images from one finger are selected as testing samples while the left are used for training. Based on the extracted FFVs, the classification results are listed in Table 1, where
Finger-Vein Recognition Based on a Bank of Gabor Filters
381
Table 1. Finger-vein image classification
m=1
m=2 FRR (%)
F-finger(350) M-finger(350) R-finger(350) FAR(%) F-finger 346(98.86%) 3 2 0.714 M-finger 2 343(98.00%) 7 1.286 R-finger 2 4 341(97.43%) 0.857 F-finger 344(98.29%) 3 3 0.857 M-finger 4 342(97.71%) 8 1.857 R-finger 2 5 339(96.86%) 1.000 m=1 1.143 2.000 2.571 m=2 1.714 2.286 3.143
F-finger, M-finger and R-finger respectively represent forefinger, middle finger and ring finger, FRR and FAR respectively represent False Rejection Rate and False Acceptance Rate. From Table 1, we can see that forefingers hold the best capability in classification, while middle fingers appear better than ring fingers in CCR (Correct Classification Rate) but lower than ring fingers in FAR. Certainly, based on the used cosine similarity measure, CCRs of 97.62 percent and 98.097 percent respectively are achieved using all test samples in the two scales. This shows that not only the proposed FFV exhibits significant discriminability but also every finger is suitable for personal identification. Hence, the finger-vein images from different fingers are viewed as from different individuals in the subsequent experiments. Thus, the database is expanded manually to 210 subjects and 10 finger-vein images per subject. To obtain an unbiased estimate of the true recognition rate, a leave-one-out cross-validation method is used here. That is, leaving one example out sequentially and training on the rest accordingly, we conduct a classification of the omitted example. Consider that CMS (Cumulative Match Scores) proposed in [18] be more general in measuring classification performance, we therefore use it to evaluate the proposed finger-vein recognition algorithm. CMS can report the correct match probability (CMP) corresponding to the ranked n matches, and CCR is equivalent to the first CMP (rank =1). Fig. 6 demonstrates the performance of the proposed method in identification and verification (for ranks up to 10). From Fig. 6, we can see that the results using FFV at the first scale are somewhat better than those at the second. This is because the features corresponding to the thin veins are extracted effectively at the first scales but they are neglected at the second scale. Furthermore, the performance of both identification and verification for decision-level fusion is improved better, especially in FAR. This demonstrates that fusion of FFVs at the two scales can improve the reliability of identification significantly. Hence, the finger-vein recognition technology is worthwhile to pay further attentions in security.
382
J. Yang, Y. Shi, and J. Yang 3.5
Fusion m=1 m=2
0.995
Fusion m=1 m=2
3 2.5
0.99 2 FRR
Cumulative Match Scores
1
0.985
1.5
0.98
1
0.975 0.97
0.5
2
4
6 Rank
(a) Identification
8
10
0 -4 10
-3
10
-2
10 FAR
-1
10
0
10
(b) Verification
Fig. 6. The results of identification and verification
6
Conclusion
We have presented a new method of personal identification based on finger-vein recognition in this paper. Determining a stable region representing finger-vein network in image plane was firstly implemented. Then, a bank of Gabor filters were used to exploit the underlying finger-vein characteristics considering the variations of finger-vein networks, and both local and global finger-vein features were extracted to form FFVs. Finally, finger-vein recognition was implemented using the nearest cosine classifier, and a fusion scheme in decision level was adopted to improve the reliability of identification. Experimental results have shown that the proposed method performed well in personal identification.
Acknowledgements This work is jointly supported by NSFC (Grant No. 60605008), TJNSF (Grant No. 07JCYBJC13500, 07ZCKFGX03700), CAUC projects (Grant No. 05qd02q, 05yk22m, 07kys01).
References 1. Xu, M., Sun, Q.: Vasculature Development in Embryos and Its Regulatory Mechanisms. Chinese Journal of Comparative Medicine 13(1), 45–49 (2003) 2. Zharov, V., Ferguson, S., Eidt, J., Howard, P., Fink, L., Waner, M.: Infrared Imaging of Subcutaneous Veins. Lasers in Surgery and Medicine 34(1), 56–61 (2004) 3. Kono, M., Memura, S.U., Miyatake, T., Harada, K., Ito, Y., Ueki, H.: Personal Identification System, USPatent No.6813010 Hitachi, United States (2004) 4. Miura, N., Nagasaka, A.: Feature Extraction of Finger-vein Pattern Based on Repeated Line Tracking and Its Application to Personal Identification. Machine Vision and Applications 15(4), 194–203 (2004) 5. Miura, N., Nagasaka, A., Miyatake, T.: Extraction of Finger-vein Patterns Using Maximum Curvature Points in Image Profiles. IEICE - Transactions on Information and Systems, 1185–1194 (2007)
Finger-Vein Recognition Based on a Bank of Gabor Filters
383
6. Lian, Z., Rui, Z., Yu, C.: Study on the Identity Authentication System on Finger Vein. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 1905–1907 (2008) 7. Zhang, Z., Ma, S., Han, X.: Multiscale Feature Extraction of Finger-Vein Patterns Based on Curvelets and Local Interconnection Structure Neural Network. In: International Conference on Pattern Recognition, pp. 145–148 (2006) 8. Vlachos, M., Dermatas, E.: Vein Segmentation in Infrared Images Using Compound Enhancing and Crisp Clustering. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 393–402. Springer, Heidelberg (2008) 9. Jie, Z., Ji, Q., Nagy, G.: A Comparative Study of Local Matching Approach for Face Recognition. IEEE Transactions on Image Processing 16(10), 2617–2628 (2007) 10. Ma, L., Tan, T., Wang, Y., Zhang, D.: Personal Identification Based on Iris Texture Analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(12), 1519– 1533 (2003) 11. Jain, A.K., Chen, Y., Demirkus, M.: Pores and Ridges: High-Resolution Fingerprint Matching Using Level 3 Features. IEEE Trans. on Pattern Analysis and Machine Intelligence 29(1), 15–27 (2007) 12. Laadjel, M., Bouridane, A., Kurugollu, F., Boussakta, S.: Palmprint Recognition Using Fisher-Gabor Feature Extraction. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1709–1712 (2008) 13. Shi, Y., Yang, J., Wu, R.: Reducing Illumination Based on Nonlinear Gamma Correction. In: IEEE international Conference on Image Processing, pp. 529–532 (2007) 14. Daugman, J.G.: Uncertainty Relation for Resolution in Space, Spatial Frequency, and Orientation Optimized by 2D Visual Cortical Filters. Journal of the Optical Society of America 2(7), 1160–1169 (1985) 15. Ren, X., Yang, J., Li, H., Wu, R.: Multi-fingerprint Information Fusion for Personal Identification Based on Improved Dempster-Shafer Evidence Theory. In: IEEE International Conference on Electronic Computer Technology, pp. 281–285 (2009) 16. Yager, R.: On the D-S Framework and New Combination Rules. Information Sciences 41(2), 93–138 (1987) 17. Brunelli, R., Falavigna, D.: Person Identification Using Multiple Cues. IEEE Trans. on Pattern Analysis and Machine Intelligence 17(10), 955–966 (1995) 18. Phillips, J., Moon, H., Rizvi, S., Rause, P.: The FERET Evaluation Methodology for Face Recognition Algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(10), 1090–1104 (2000)
Author Index
Abe, Toru III-611 Aganj, Ehsan II-468, III-667 Agrawal, Prachi III-266 Ahuja, Narendra I-123 Akbas, Emre I-123 Allain, Pierre II-279 An, Yaozu III-475 Arandjelovi´c, Ognjen III-203 Ariki, Yasuo II-291 Arita, Daisaku I-201 Audibert, Jean-Yves II-502
Chen, Mei I-303 Chen, Songcan III-1 Chen, Yen-Lin I-71 Cheung, Sen-ching S. I-37 Choi, Jin Young II-130 Chu, Chien-Hung III-85 Chu, Yu-Wu III-621 Cleju, Ioan III-426 Corpetti, Thomas II-279 Courchay, J´erˆ ome II-11 Courty, Nicolas II-279
Bagarinao, Epifanio III-363 Bai, Xiang III-456 Baradarani, Aryaz III-226 Barnes, Nick II-335 Beetz, Michael II-247 Ben Ayed, Ismail III-172 Billinghurst, Mark II-1 Binder, Alexander III-351 Bischof, Horst I-281, II-477, III-655 Brefeld, Ulf III-351 Brooks, Rupert III-436 Bujnak, Martin I-13
Da, Bangyou III-570 Da, Feipeng III-581 Dai, Yuchao II-335 Dai, Yuguo III-130 Dai, Zhenwen III-96 Derpanis, Konstantinos G. II-301 Diepold, Klaus II-44 Di, Huijun III-548 Ding, Jundi III-1 Dixit, Mandar II-140 Do, Ellen Yi-Luen I-313 Dong, Ligeng III-548 Donoser, Michael I-281, III-655
Cai, Ling III-21 Cao, Hui II-628 Cao, Jian II-576 Cao, Tian III-130 Cao, Xiaochun II-536 Cao, Yang I-224 Cao, Yuqiang II-526 Caputo, Barbara I-269 Chai, Jinxiang I-71 Chaillou, Christophe II-120 Chang, I-Cheng II-257 Chari, Visesh II-34 Charvillat, Vincent I-1 Chauve, Anne-Laure II-502 Chen, Duowen I-113 Chen, Jianbo II-608 Chen, Jiazhou II-697 Chen, Ju-Chin II-98 Chen, Kai II-608
Emmanuel, Sabu
III-538
Fang, Chih-Wei III-85 Fang, Chin-Hsien II-98 Fan, Ping III-118 Fan, Shufei III-436 Feng, Jufu III-591 Feng, Wei II-707 Ferrie, Frank P. III-436 Frahm, Jan-Michael I-157 Fujimura, Kikuo II-267 Fujiyoshi, Hironobu II-655 Fukui, Kazuhiro I-323 Funatomi, Takuya III-140 Furukawa, Ryo III-516 Fu, Yun I-364, III-236 Gambini, Andrea II-371 Gao, Jizhou I-37
386
Author Index
Garcia, Vincent II-514 Geng, Yanlin III-33 Giardino, Simone II-371 Gigengack, Fabian II-438 Gong, Weiguo II-526 Grasset, Raphael II-1 Guan, Haibing II-608 Gu´enard, J´erˆ ome I-1 Guo, Guodong III-236 Guo, Jun II-546, III-321 Guo, Peihong III-496 Guo, Xiaojie II-536 Gurdjos, Pierre I-1 Hamada, Chieko III-611 Hancock, Edwin R. II-23, III-373 Hao, Pengwei I-354, II-172, III-33 Hartley, Richard II-335 Hassner, Tal II-88 He, Lei III-21 He, Shihua I-354 Higashikubo, Masakatsu III-363 Hou, Xiaoyu III-311 Hsieh, Sung-Hsien III-85 Hsu, Gee-Sern III-560 Hua, Gang II-182 Huang, Jia-Bin I-48 Huang, Jianguo III-118 Huang, Kaiqi I-180, II-586 Huang, Tianci III-75 Huang, Xinyu I-37 Huang, Yongzhen I-180 Hua, Wei II-697 Hua, Xian-Sheng III-485 Hu, Fuqiao III-506 Hung, Yi-Ping III-560 Huo, Hongwen III-591 Hu, Weiming I-103, I-343, II-236, II-667, III-527 Hu, Zhanyi II-66 Ikenaga, Takeshi III-75 Ikeuchi, Katsushi I-190, I-234 Inayoshi, Hiroaki III-363 Islam, Ali III-172 Islam, Md. Monirul II-448 Iwai, Yoshio III-65 Jakkoju, Chetan II-34 Jawahar, C.V. II-34
Jeong, Yekeun I-25 Jiang, Ping II-687 Jiang, Wei II-347, III-395 Jiang, Xiaoyi II-438 Jiang, Xiaoyue III-118 Jiang, Zhiguo III-162 Jiang, Zhongding II-56 Jia, Yunde I-103, III-11 Jie, Luo I-269 Jin, Cheng I-333 Jing, Xuan III-538 Kakusho, Koh III-140 Kanade, Takeo II-655 Kawai, Norihiko II-359 Kawai, Yoshihiro III-406 Kawanabe, Motoaki III-351 Kawanishi, Yasutomo III-140 Kawasaki, Hiroshi III-516 Keriven, Renaud II-11, II-468, II-502, III-644, III-667 Kim, Junae III-299 Kim, Soo Wan II-130 Kim, Tae Hoon III-416 Kinoshita, Tetsuo III-611 Klein Gunnewiek, Rene II-381 Kluckner, Stefan II-477 Kong, Yu I-103 Kontschieder, Peter III-655 Kukelova, Zuzana I-13 Kukenys, Ignas III-331 Kurita, Takio III-363, III-384 Kweon, In So I-25 Lai, Jian-Huang III-601 Lan, Kunyan II-546 Lan, Tian II-66 Lao, Shihong III-506 Lee, Kyoung Mu III-416 Lee, Ping-Han III-560 Lee, Sang Uk III-416 Liang, Siyu II-56 Li, Chenxuan II-608 Li, Chi III-570 Li, Chun-guang III-321 Li, Chunming I-293 Lien, Jenn-Jier James II-98, III-85 Li, Heping II-556 Li, Hongdong II-335 Li, Jing II-635
Author Index Lin, Shih-Yao II-257 Lin, Tong III-33 Lin, Zhouchen III-33, III-311 Li, Ping II-381 Li, Shuo III-172 Liu, Dong C. III-130 Liu, Duanduan II-110 Liu, Hongmin III-448 Liu, Huanxi III-246 Liu, Lin III-43 Liu, Peijiang III-108 Liu, Risheng III-311 Liu, Rui III-485 Liu, Shaohui III-152 Liu, Tie II-193 Liu, Tyng-Luh III-621 Liu, Wan-quan III-601 Liu, Wentai I-258 Liu, Wenyu III-456 Liu, Xiangyang I-61 Liu, Yang II-667 Liu, Yuehu III-466 Liu, Yuncai I-364, II-214, II-313, III-246 Liu, Yushu II-576, II-645 Liu, Zhi-Qiang II-707 Liu, Zicheng II-182 Li, Wanqing III-193 Li, Wei III-256 Li, Xi I-323, I-343, II-667, III-527 Li, Xiaoli III-581 Li, Xiong III-246 Li, Xuewei II-536 Li, Yangxi III-43 Li, Yin I-246, II-313 Li, Yuan II-687 Lu, Guojun II-448 Lu, Hong I-333 Lu, Hongtao I-61 Luo, Guan I-343 Luo, Tao II-427 Lu, Shaopei I-147 Lu, Wenting II-546 Lu, Yao III-475 Lu, Zhaoyang II-635 Lv, Xiaowei III-246 Machikita, Kotaro II-359 Macione, Jim I-293 Makihara, Yasushi II-204
Maruyama, Kenichi III-406 Matsui, Sosuke I-213 Matsukawa, Tetsu III-384 Matsushita, Yasuyuki I-234 Mattoccia, Stefano II-371 Mauthner, Thomas II-477 Maybank, Stephen I-343 Maybank, Steve II-236 Ma, Yi I-135 Ma, Yong III-506 Ma, Zheng II-160 McCane, Brendan III-331 Minhas, Rashid III-226 Minoh, Michihiko III-140 Miyazaki, Daisuke I-234 Mobahi, Hossein I-135 Monasse, Pascal II-11, II-468 Mooser, Jonathan II-1 Mori, Greg II-417 Morin, G´eraldine I-1 Mu, Guowang III-236 Mukaigawa, Yasuhiro III-287 Naemura, Takeshi II-489 Nagahara, Hajime III-287 Nagahashi, Tomoyuki II-655 Naito, Takashi II-628 Nakayama, Toshihiro III-516 Narayanan, P.J. III-266, III-633 Nelakanti, Anil II-34 Neumann, Ulrich II-1 Neumegen, Tim III-331 Ngo, Thanh Trung III-287 Nguyen, Duc Thanh III-193 Nguyen, Quang Anh II-224 Nielsen, Frank II-514 Nijholt, Anton II-110 Ninomiya, Yoshiki II-628 Niu, Changfeng II-645 Niu, Zhibin I-246 Nock, Richard II-514 Noguchi, Akitsugu II-458 Ogunbona, Philip III-193 Okabe, Takahiro I-213 Okutomi, Masatoshi II-347, III-395 Oliver, Patrick III-548 Orabona, Francesco I-269 Pajdla, Tomas I-13 Pan, ChunHong I-83
387
388
Author Index
Pan, Chunhong II-120 Pang, HweeHwa III-1 Peng, Bo II-677 Peng, Qunsheng II-697 Petrou, Maria III-341 Pham, Viet-Quoc II-489 Poel, Mannes II-110 Pollefeys, Marc I-157 Pons, Jean-Philippe II-11, II-502, III-667 Pourreza, Hamid Reza II-325 Pu, Jian III-496 Punithakumar, Kumaradevan III-172 Qi, Kaiyue II-608 Qin, Bo II-56 Qin, Xueying II-697 Qiu, Huining III-601 Qiu, Jingbang III-75 Rao, Shankar R. I-135 Ravyse, Ilse III-118 Ren, Zhang III-277 Rezazadegan Tavakoli, Hamed II-325 Ricco, Susanna III-214 Riemenschneider, Hayko I-281 Robles-Kelly, Antonio II-224 Ross, Ian III-172 Roth, Peter M. II-477 Ruan, Qiuqi III-256 Sagawa, Ryusuke III-287 Sahli, Hichem III-118 Sang, Nong III-570 Sarkis, Michel II-44 Sastry, S. Shankar I-135 Sato, Tomokazu II-359 Sato, Yoichi I-213 Saupe, Dietmar III-426 Schnieders, Dirk III-96 Seifzadeh, Sepideh III-226 Shan, Ying II-182 Shao, Ming III-108 Shen, Chunhua III-277, III-299 Sheng, Xingdong II-193 Shen, Jialie III-1 Shen, Shuhan II-214 Shi, Boxin III-43 Shimada, Atsushi I-201 Shimano, Mihoko I-213
Shimizu, Masao II-347, III-395 Shi, Wenhuan II-214 Shi, Yihua I-374 Shi, Zhenwei III-162 Smith, William A.P. II-23 Song, Jinlong III-506 Sun, Quansen I-293 Sun, Xi II-405 Su, Zhixun III-311 Taigman, Yaniv II-88 Takahashi, Keita II-489 Takamatsu, Jun I-190 Takiguchi, Tetsuya II-291 Tanaka, Tatsuya I-201 Tang, Ming I-113 Taniguchi, Rin-ichiro I-201 Tan, Tieniu I-180, II-586 Tao, Dacheng I-180 Tao, Hai I-258 Tao, Linmi III-548 Thorstensen, Nicolas III-644 Tomasi, Carlo III-214 Tomita, Fumiaki III-406 Tonaru, Takuya II-291 Trumpf, Jochen II-335 Tseng, Chien-Chung II-98 Venkatesh, K.S. II-140 Vineet, Vibhav III-633 von Hoyningen-Huene, Nicolai Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang,
Bo III-130, III-456 Daojing II-172 Guanghui I-169, II-78 Haibo II-120 Hanzi I-103, III-527 Junqiu II-204 Lei III-299 Li I-293 LingFeng I-83 Peng III-277 Qi II-405, III-53 Qing I-313 Qiongchen III-162 Te-Hsun III-85 Yang II-417 Yuanquan I-147, III-11 Yunhong III-108 Zengfu II-405, III-53
II-247
Author Index Wang, Zhiheng III-448 Wang, Zhijie III-183 Wei, Ping III-466 Wei, Wei II-150 Welch, Greg I-157 Wildes, Richard P. II-301 With, Peter de II-381 Wolf, Lior II-88 Wong, Hau-San I-93 Wong, Kwan-Yee K. III-96 Wu, Fuchao III-448 Wu, Huai-Yu II-427 Wu, HuaiYu I-83 Wu, Jing II-23 Wu, Q.M. Jonathan I-169, II-78, III-226 Wu, Si I-93 Wu, Xiaojuan III-256 Wu, YiHong II-66 Xia, Deshen I-293 Xia, Shengping III-373 Xia, Xiaozhen II-556 Xie, Lei II-707 Xiong, Huilin II-566 Xu, Chao III-43 Xue, Jianru II-160 Xue, Xiangyang I-333 Xu, Guangyou III-548 Xu, Mai III-341 Xu, Yao III-456 Xu, Yiren III-21 Yachida, Masahiko III-287 Yagi, Yasushi II-204, III-287 Yamaguchi, Koichiro II-628 Yamaguchi, Takuma III-516 Yamamoto, Ayaka III-65 Yamashita, Takayoshi I-201 Yanai, Keiji II-458 Yang, Allen Y. I-135 Yang, Heng I-313 Yang, Hua I-157 Yang, Jian II-677 Yang, Jie I-246, I-303 Yang, Jinfeng I-374 Yang, Jingyu III-1 Yang, Jinli I-374 Yang, Junli III-162 Yang, Lei I-303 Yang, Linjun III-485
389
Yang, Ming-Hsuan I-48 Yang, Niqing III-256 Yang, Ruigang I-37 Yang, Weilong II-417 Yang, Wuyi II-556 Yang, Xin II-566, III-21 Yang, Xu II-566 Yang, Yang I-303, III-466 Yang, Zhi I-258 Yan, Junchi I-246, II-313 Yao, Hongxun III-152 Yi, Kwang Moo II-130 Yokoya, Naokazu II-359 Yoon, Kuk-Jin I-25 You, Suya II-1 Yuan, Chunfeng I-343, III-527 Yuan, Xiaoru III-496 Yuan, Zejian II-193 Yu, Yinan II-586 Yu, Zhiwen I-93 Zha, Hongbin II-427 Zhai, Zhengang III-475 Zhang, Cha II-182 Zhang, Chao I-354, II-172 Zhang, Daqiang I-61 Zhang, David II-618 Zhang, Dengsheng II-448 Zhang, Geng II-193 Zhang, Hong III-183 Zhang, Hong-gang III-321 Zhang, Honggang II-546 Zhang, Hua II-110 Zhang, Jianzhou II-687 Zhang, Jiawan II-536 Zhang, Jing I-113 Zhang, Junping III-496 Zhang, Lei II-56, II-618, II-677 Zhang, Lin II-618 Zhang, Liqing I-224, II-395 Zhang, Peng III-538 Zhang, Shuwu II-556 Zhang, Wei I-169 Zhang, Xiangqun II-576 Zhang, Xiaoqin I-103, II-236 Zhang, Xu II-576 Zhang, Yanning II-150, III-118, III-538 Zhang, Yu II-635 Zhang, Yuanyuan III-256 Zhang, Zhengyou II-182
390
Author Index
Zhang, Zhiyuan II-608 Zhang, Zhongfei II-667 Zhao, Danpei III-162 Zhao, Qi I-258 Zhao, Rongchun III-118 Zhao, Xu I-364 Zhao, Yuming III-21, III-506 Zheng, Bo I-190 Zheng, Enliang II-313 Zheng, Hong III-277 Zheng, Hongwei III-426
Zheng, Nanning I-303, I-323, II-160, II-193, III-466 Zheng, Songfeng II-596 Zheng, Yingbin I-333 Zhong, Bineng III-152 Zhong, Fan II-697 Zhou, Bolei II-395 Zhou, Jun II-224 Zhou, Yue I-246 Zhu, Youding II-267 Zia, Waqar II-44